By R K Sharath Kumar | Updated August 11, 2018 - Published March 23, 2018
AnalyticsData ScienceCloudOn PremisesFinanceRetail
As a data mining application, IBM SPSS Modeler offers a strategic approach to finding useful relationships in large data sets. In contrast to more traditional statistical methods, you do not necessarily need to know what you are looking for when you start. You can explore your data, fitting different models and investigating different relationships, until you find useful information. This tutorial was tested on Windows 7 using IBM SPSS Modeler v18.1.
Upon completing this tutorial you will know how to:
It should take approximately 30 minutes to complete this tutorial.
Working with IBM SPSS Modeler is a three-step process of working with data.
This sequence of operations is known as a data stream because the data flows record by record from the source through each manipulation and, finally, to the destination either a model or type of data output.
Proceed with the step by step process. The nodes are selected, manipulated and connected to the subsequent nodes by right-clicking on each node and pressing Connect to move to the next node.
This is the first step in the SPSS stream. Select the Var.File node under Sources palette with the drag-and-drop method onto the SPSS Modeler Interface. The Var.File option is used for reading csv files, text files, etc.
The next action is to read the data in SPSS. Select the radio button to the right of the file and navigate to the folder where the data file is saved. Click open and then click OK.
The third action is to select the Data Audit node from the Output palette.
The Data Audit node helps to identify how many valid records exist along with basic statistics. The screenshot below shows the total number of obs is 84672 and the attributes Revenue and below have only 24743 valid records and the remaining are nulls.
The next action is to replace the nulls with mean values for respective attributes. Select the Filler node from the Field Ops Palette and then select the Set Globals node from the Output palette to find the mean of multiple attributes.
Replacing Nulls with Mean value.
The numerical columns should be real numbers. Here a new attribute named Quantity_New is created to convert the data type from integer to real number.
Replace the Nulls with Mean value for the newly created variable.
Run the Data Audit again to check whether Nulls has been replaced by Mean value. Notice that all attributes except Quantity have 84672 records. Quantity will be replaced by the Quantity_New attribute for data analysis and modelling.
Moving on to the next action. Select the input parameters and target variable. Select Type node from the Field Ops palette.
The Type node allows the selection of input variables and Target Variable. Selection should be made per below and the categorical variables are ignored.
We need to partition the data using the recommended 70:30 split between Training & Testing Data. Select Partition node from Field Ops palette.
The model will be built on Training data and will be tested on the Testing Data. Create partition in the data.
Use Select node from Record Ops to Select Training Data for model building.
Select Training Data node and rename the node as Training_Data by clicking on Annotations. Default name would be Select.
From the Modelling palette, select Auto Numeric node onto the interface with drag and drop. We are selecting Auto Numeric node as we are predicting a continuous variable(Numeric data). If you want to predict a categorical variable then select Auto Classifier node.
The name of the node defaults to the variable that we are trying to predict. In this case, it would be Revenue. In this node, we use predefined roles under Fields for Modelling purpose. The reason is we have already selected the input and target variables under Type node in previous step.
We select the parameters for modelling per below.
We can select different algorithms under Expert tab in the Auto Numeric Node and then right-click on the node and click Run.
We have three models created by SPSS for the prediction as we had specified the number of models to use as 3 in Auto Numeric node.
Select the first model and click on Graph tab to view scatter plot and predictor importance.
Click on the Summary tab to identify the input/target variables and other details.
Use Select node from Record Ops to Select Testing Data for model testing and evaluation.
Select Testing Data and rename the node as Testing_Data by clicking on Annotations. The default name is Select.
Right-click on Testing_Data node and connect to model nugget, then click run. Select Analysis node from Output palette and connect it to model nugget and click run.
Analyze the results. In this case, the model is 100% accurate with 0 errors which is not the case most of the time. The model accuracy and errors are also dependent on the data being used.
Select Table node from Output palette to export the results.
In the Table node, select output to file option to export the results to a csv file. Select the radio button next to File name and provide the path for the csv file to get exported to. The output file has two additional attributes, $XR-Revenue is our predicted output which is average of each model’s individual prediction and $XRE-Revenue is for standard error of the predictions made by ensemble models.
The complete flow of the stream is shown below.
This is an attempt to show the basic steps to create a statistical model. The steps can be further enhanced to suit different requirements.
Thank you for reading through this tutorial, we hope you are more well-versed and eager to use IBM SPSS Modeler 18.1 for your next data analysis project.
Back to top