I have couple of questions on the sample node usage in SPSS Modeler v16: 1. When using sample node 'Random %' on the telco.sav dataset, I specify the % = 30. But this results in obtaining records not exactly equal to 300, but say like 293 or even 320. Is this an expected behavior ? If yes, is there a way to obtain exact number of records from a dataset. 2. Could someone help me understand under which scenarios, is the 1-in-n sampling found most useful?
Yes - this is expected behavior as the sampling is done at random.
If you need an exact number of records, this can be achieved in several different ways, but I prefer to change the "Sample method" to
Complex and then select:
You can then specify the exact number of records in your sample.
Historically there were some advantages of systematic (every nth) sampling. 1) It saved generating random numbers - that used to be a pen and paper job 2) It actually ensures greater coverage 3) Unless the data are already randomised in order, systematic sampling can actually improve precision compared to a simple random sample. Today in practice - 1 is trivial and 2&3 you can use the more recently added complex sampling options.
Use-cases are now less obvious but I guess obscure ones might include... 1) Where you want to model a process that behaves as 1/n in the real world 2) Where you want to demonstrate the dangers of using 1/n sampling 3) Where you are working with streaming data, ie selecting entities in real time eg recording every 10th inbound communication in a call centre etc