Recently this problem was posed on the SPSSX-L listserv (linked on the SPSS Community site): Count the number of distinct values in a set of variables for each case. This led to a lively discussions of alternative solutions. Most used traditional syntax. I used Python programmability.
Datasets sometimes need to be restructured between wide form, where there are multiple measurements on a concept expressed as different variables within a case, and long form, where each repeated measurement is a separate case. Commonly, repeated measures, for example, test scores at different points of time for a subject, are stored in wide form, and IBM SPSS Statistics procedures that focus on repeated measures designs tend to require this form. Most data transformations and simple summary statistics, however, are intended for long form. Thus an easy way to convert between these forms is important.
IBM SPSS Statistics provides the commands CASESTOVARS and VARSTOCASES to convert between these forms. These are used by the Restructure Data Wizard , which appears on the Data menu.
SPSS Statistics has a number of transformation functions that work across variables within a case, such as mean, median, and sum, and the COUNT command that counts occurrences of a particular value, but there is no built-in function for counting distinct values. By converting these data from wide form to long form, the problem can be solved with traditional syntax.
The traditional syntax solution, then, has these steps (mainly worked out by David Marso) The syntax can all be generated using the GUI.
- Save the dataset if there have been changes.
- Use VARSTOCASES to replace the active dataset with one containing an id variable (from the original data or generated) and one variable representing the variables over which the calculation will be done. Call that new variable Z. The number of records is M * NV, where M is the number of IDs and NV is the number of variables over which the count is required.
- Use AGGREGATE with the ID variable and Z as the break variables. Use N as the statistic. Now the number of records in this new dataset is M * average number of distinct values.
- Activate the new dataset and use AGGREGATE again on it breaking just on ID, and use N as the statistic. This results in one record per ID with the count of distinct values. The dataset set is M records.
- Get the original dataset, and use MATCH FILES to add this value back to it.
This works, but it is quite a few data passes, and it takes some study to understand the code.
The programmability solution is much simpler. It uses the SPSSINC TRANS extension command along with a two-line Python program to arrive at the same result. Here is the Python program followed by the command syntax.
The program explanation
- The "*args" signature in the countThem function means that args will be a list of the arguments (variable values in this case) passed to the function when it is called, so the function can handle any number of variables.
- set(args) creates a set from that list. Since a set can only contain the same item once, it will contain only the distinct values. If I pass it the list [1,2,1], the set will contain two members: 1 and 2.
- The length of the set, returned by the len function, is the number of members and hence the return value is the number of distinct values. This includes None, if there were any SYSMIS values. That could easily be excluded if they should not be counted.
The SPSSINC TRANS explanation
- The program defined in the begin program block remains available throughout the session even though the program block has terminated, so once it is defined, it can be called elsewhere in that session.
- SPSSINC TRANS is an extension command implemented in Python that applies the code in the FORMULA subcommand to the cases in the active dataset and stores the result in the RESULT variable. It fetches the variable values referenced in the formula and calls whatever function was specified.
- Since countThem is called without any module qualifier, the extension command first tries to find that function in the items that have been defined in begin program blocks. If not found there, it looks for a function built in to Python, e.g., min, max, sum, … If you have a function defined in some other module, you can reference it as modulename.funcname, and the command will load that module and then call the indicated function.
- The entire formula is quoted so that it will not be digested by the Statistics parser but rather passed as is to the command. Variable names are case sensitive.
- SPSSINC TRANS can return more than one value, hence creating multiple variables, and it has a number of other features that can be found in its syntax or dialog box help (Transform>Programmability Transformation).
So which is better? The traditional syntax does not require any programmability knowledge and works on quite elderly versions of Statistics, but it is rather complicated and takes many data passes. The Python approach requires a little knowledge of programmability and requires at least version 17 (2008) and the Python plugin, but, given that, it is easy to read and takes only one data pass. That data pass, however, will be slower than a native data pass.
The Python Essentials and the SPSSINC TRANS extension command can be downloaded from the SPSS Community site (www.ibm.com/developerworks/spssdevcentral if you are not reading this on the site).