A while ago I had a post about generating fake data in SPSS. This is useful for conducting your own simulations, or as I mentioned in the prior post it is often useful when posting questions to discussion lists to have example data that demonstrates your problem.

As of SPSS version 21, it includes a new simulation procedure in which you can take a predictive model and simulate new outcomes (it is in base, so everyone has it if you have a version of SPSS 21 or later). You can also use it to create data from scratch, with unique distributions for each variable and have it conform to either an approximate correlation matrix or a contingency table. Below is an example set of syntax that creates a simulation plan using the `SIMPLAN` function.

``````FILE HANDLE save /NAME = "!your handle here!".
SIMPLAN CREATE
/SIMINPUT
INPUT = input1(FORMAT=F)
TYPE = MANUAL
DISTRIBUTION = POISSON(MEAN=1.2)
/SIMINPUT
INPUT = input2 input3
TYPE = MANUAL
DISTRIBUTION = NORMAL(MEAN = 0 STDDEV = 1)
/CORRELATIONS
VARORDER = input1 input2 input3
CORRMATRIX = 1; 0.5, 1; 0.2 0.1 1
/STOPCRITERIA MAXCASES=50000
/PLAN FILE='savetest.splan'.``````

First I create a file handle named `save` that ends up being where I save the `splan` file. The `SIMPLAN` function is quite complex, but here is a quick rundown of what is going on in this particular statement:

• On the `SIMINPUT` subcommand, you can specify a variable to create, its format and its marginal distribution. Note numeric formats on this command are a little different than typical, they are not of the form `F3.0`, but just take an input format type and then if you want decimals would be `F,2` for two decimals.
• I then have two separate `SIMINPUT` subcommands, the first specifies `input1` as being distributed as Poisson with a mean of 1.2. The second specifies variables `input2` and `input3` as being normally distributed with a mean of zero and a standard deviation of 1.
• The `CORRELATIONS` subcommand then lets you specify a set of approximate bivariate correlations for each of the variables.
• The `STOPCRITERIA` subcommand then specifies that only 50,000 cases are generated.
• The `PLAN` subcommand then specifies a file to save the simulation plan to.

Once you have the simulation plan created, you can then run the `SIMRUN` command to generate the data.

``````SIMRUN
/PLAN FILE='savetest.splan'
/CRITERIA  REPRESULTS=TRUE  SEED=10
/PRINT ASSOCIATIONS=YES DESCRIPTIVES=YES
/OUTFILE FILE=*.``````

Note in this code snippet I save the simulated data to the active dataset, which was a feature added as of V22. Otherwise you need to use `DATASET DECLARE` and the specify that file name on the `OUTFILE` command (or I presume save to a file).

Using the simulation builder is probably overkill for generating data for troubleshooting problems to the list-serve, although if you use its capabilities to mimic the current dataset it can be a quick way to generate fake data that you can upload to a public site without divulging confidential info. This is certainly just scratching the surface of what the simulation builder can do though. I’m hoping its functionality will be continually extended in the future, in particular more complicated transformations being allowed on the `SIMPREP` command I would like to see.

2 comments on"Using SPSS’s SIMPLAN to generate fake data"

1. Most people would probably want to start with the Simulation dialogs as the procedures are pretty complex. For replicating the active dataset distributions where you don’t want the actual data to appear, there is a simplified dialog as an extension command named Simulate Active Data (Data > Simulate Active Data) available for V21 and 22. It can be downloaded from the SPSS Community site (www.ibm.com/developerworks/spssdevcentral) in the Extension Commands collection, or, for V22 it can be installed using Utilities > Extension Bundles > Download and Install Extension Bundles.

• Just checked out your paired down dialogue – very nice. The main simulation builder dialogue one is very complicated – by necessity as it is a complicated procedure (ditto for many of the more complicated regression commands like GENLIN or MIXED as well).