SPSS Analytics Toolkit Tutorial

Edit me

Overview

Streams is a platform that enables real-time analytics of data in motion. The IBM SPSS family of products provides the ability to build predictive analytic models. The IBM SPSS Analytics Toolkit is for Streams developers who need to leverage the powerful predictive models in a real-time scoring environment. In this lab, you will be building Streams applications to use a predictive model to analyze cell characteristics from patients who are believed to be at risk of developing cancer.

The SPSS Analytics Toolkit

The SPSS Analytics Toolkit (com.ibm.spss.streams.analytics) contains Streams operators that integrate with IBM SPSS Modeler and SPSS Collaboration and Deployment Services products to implement various aspects of SPSS Modeler predictive analytics in your Streams applications. The SPSS Analytics Toolkit is installed by the SPSS Modeler Solution Publisher product, which is shipped by SPSS Collaboration and Deployment Services release 5.0 and later.

This lab was developed using Streams 4.0.1 and SPSS Modeler Solution Publisher version 17, however, older versions of both products that are compatible with each other should work for this lab.

The following operators are available in the SPSS Analytics Toolkits:

SPSSScoring operator – integrates with SPSS Modeler Solution Publisher to enable the scoring of your SPSS Modeler -designed predictive models in Streams applications
SPSSPublish operator – automates the ‘publish’ of a Modeler file’s scoring branch and summarizes the generated files so down-stream operators can refresh their scoring implementation with the PIM, PAR and XML files created or updated by the ‘publish’ operation
SPSSRepository operator – detects notification events indicating changes to the deployed models managed in the SPSS Collaboration and Deployment Services repository and retrieves the indicated Modeler file version for automated publish and preparation for use in your Streams applications

Data

In this lab you will be working with a dataset containing characteristics of a number of human cell samples extracted from patients who were believed to be at risk of developing cancer.

Analysis of the original data showed that many of the characteristics differed significantly between benign and malignant samples. A Support Vector Machine (SVM) model was developed that can use the values of the these cell characteristics in samples from other patients to give an early indication of whether their samples might be benign or malignant.

The predictive Analytic Models are built using the IBM SPSS Modeler product. The models used in the Streams application have already been developed and are available for download here: SPSSModels.zip.

Exercise 1 uses a data set of patient data located in the Exercise1 project’s data directory named “cell_samples.data”.

Exercise 2 will use a Beacon operator to generate sample data.

The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007). The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:

Field Name	Description
ID	Patient Identifier
Clump	Clump thickness
UnifSize	Uniformity of cell size
UnifShape	Uniformity of cell shape
MargAdh	Marginal adhesion
SingEpiSize	Single epithelial cell size
BareNuc	Bare nucleoli
BlandChrom	Bland chromatin
NormNucl	Normal nucleoli
Mit	Mitoses
Class	Benign or malignant

Downloads

SPSSModels.zip – Contains the SPSS model files used throughout the lab
SPSS_SPLProjects.zip – Contains the SPL Projects used in the exercises as well as the solution projects

Setup for the Lab

Ensure that the SPSS Modeler Solution Publisher product is installed (included in SPSS Collaboration and Deployment Services 5.0 or later)
- See http://www-01.ibm.com/software/analytics/spss/ for more information
Extract SPSSModels.zip into your home directory
Import SPSS_SPLProjects.zip into Streams Studio
In Streams Studio, use Streams Explorer to add the com.ibm.spss.streams.analytics toolkit, which comes packaged with SPSS Modeler Solution Publisher
In the instance you are going to submit your applications to, add the CLEMRUNTIME environment variable (it points to your SPSS Publisher installation):
```
streamtool  setproperty --application-ev -i  CLEMRUNTIME=/path/to/spss_publisher/install
```

Exercise 1 – Produce a prediction and confidence for each cell sample in a file

Problem Statement

In this lab, use the SPSSScoring operator to calculate a prediction (benign=2 or malignant=4) and a confidence 0-100% based on cell sample data read from a file. Start with the Exercise1 code. It already contains the schema definition of the incoming data, the FileSource to read that data, a Functor that simulates the prediction and a FileSink to write out the predicted values. Your task is to replace the Functor with the SPSSScoring operator. There is a completed version in the Exercise1Solution project.

Outline

As a challenge, feel free to use this outline rather than the step by step instructions to build the application. If you get stuck, the completed exercise is in the Exercise1Solution project.

Create a new SPSSScoring operator.
Specify the pimfile, parfile and xmlfile parameters on the operator to point to the SPSS published model artifacts for model svm_cancer-goodrbf in the directory where you extract SPSSmodels.zip.
Specify the modelFields and streamsAttributes parameters that provide the necessary mapping from the streams tuple attributes to the model inputs. Hint: look in the SPSS published model xml file in the tags to find the parameter names and datatypes needed by the SPSS model.
Specify the output section to populate the prediction and confidence output tuple attributes from the model execution result. Hint: look in the SPSS published model xml file in the tags to find the parameter names and datatypes produced by the SPSS model.

Step By Step Instructions

Right-click one the Excercise1 project. Select “Configure SPL Build.” Open the “SPL Build” twistie and click on “Environment.” Add the CLEMRUNTIME and SPSS_TOOLKIT_INSTALL environment variables.

CLEMRUNTIME /path/to/spss_publisher/install SPSS_TOOLKIT_INSTALL /path/to/com.ibm.spss.streams.analytics
Open the Exercise1 project twistie, and right click on the Exercise1 main composite. Select “Open with Graphical Editor“.
In the Graphical Editor, search for the SPSSScoring operator by typing it in the text box above the palette on the right side.
Drag the SPSSScoring operator from the palette and drop it onto the Functor operator in the middle of the graph.
Note: This technique allows you to replace an operator with a different one. The editor will handle refactoring.
Select ‘Yes’ when prompted to override the selected operator.
Right-click on the SPSSScoring operator and select ‘Edit’
In the Properties view that opens, click on the Param tab.

Update each of the parameters with the following values:

pimfile: "/home/streamsadmin/SPSS/Models/svm_cancer-goodrbf.pim" 
parfile: "/home/streamsadmin/SPSS/Models/svm_cancer-goodrbf.par" 
xmlfile: "/home/streamsadmin/SPSS/Models/svm_cancer-goodrbf.xml" 
modelFields: "ID","Clump","UnifSize","UnifShape","MargAdh","SingEpiSize","BareNuc","BlandChrom","NormNucl","Mit","Class" 
streamAttributes: patientId,clump,sizeUniformity,shapeUniformity,marginalAdhesion,singEpiSize,bareNucleoli,blandChromatin,normalNucleoli,mitoses,actualClass

Note: You can use the above to copy and paste into the Studio parameters page. When finished the values should match the ones below.

In the Properties view, click on the Output tab
Expand the tree items and update the ‘prediction’ and ‘confidence’ attributes with the following values:

prediction: fromModel(“SClass”) confidence: fromModel(“SPClass”)

Note: You can use the above to copy and paste into the Studio parameters page. When finished the values should match the ones below.
Save the changed file. The project will auto-compile for you. Fix any errors.

Run

Run the distributed build for the Exercise1 main composite.

The output will be written to the output.txt file in the projects data directory and should look like: