IBM SPSS and Infosphere Streams FAQ
This answers the common questions we get about using IBM InfoSphere Streams and SPSS together.
Q: Why use Streams and SPSS together?
IBM SPSS Modeler provides a state-of-the-art environment for understanding data and producing predictive models. InfoSphere Streams provides a scalable high-performance environment for real-time analysis of data in motion, including traditional structured or semi-structured data, and unstructured data types. Some applications have a need for deep analytics derived from historic information to be used to score streaming data in low-latency, high-volume, and real time, and to leverage those analytics. The SPSS Analytics Toolkit for InfoSphere Streams lets you integrate the predictive models designed and trained in IBM SPSS Modeler with your IBM InfoSphere Streams applications.
Q: What do I need to use them together?
- You need models built with the IBM SPSS Modeler product.¬† The SPSS Modeler product is installed on a windows workstation.
- You need InfoSphere Streams installed on the linux machine(s) where Streams will be running.
- You need the linux Solution Publisher component of the SPSS Collaboration and Deployment Services Product installed on any node in the Streams cluster that will be used to score SPSS models in Streams.
- You need to configure the streams application to reference the SPSS Analytics Toolkit for InfoSphere Streams which was installed as part of the Solution Publisher install.
Q: Where can I find Information on how to use the 2 together?
- The InfoSphere Streams RedBook IBM InfoSphere Streams: Accelerating Deployments with Analytic Accelerators describes visual development, visualization, adapters, analytics, and accelerators for IBM InfoSphere Streams.¬† Chapter 15 covers the SPSS toolkit and it describes the required steps from model building to implementing published models into InfoSphere Streams as well as the development process itself. ¬†The redbook can be downloaded for free here.
- For a step by step walk through of using the operators in the toolkit, complete the SPSS Analytics toolkit lab.
- The SPSS Analytics Toolkit for InfoSphere Streams documentation has full details on installation, operators and example usage.
Q: What about PMML and the Modeling Toolkit provided in the Streams product?
The Streams product includes a model scoring toolkit that supports several mining models via PMML. The Mining toolkit scoring was built for use with the IBM InfoSphere Warehouse built PMML models. (see Overview of the Mining Toolkit)¬† It theoretically would work with any ‚Äúcompatible‚ÄĚ PMML models at the specified versions.
The following table shows the supported PMML versions for each supported algorithm.
Table 1. Supported PMML versions for each algorithm
|Decision Trees||2.0 – 3.0|
|Na√Įve Bayes||2.0 – 3.2|
|Logistic Regression||2.0 – 3.2|
|Demographic Clustering||2.0 – 3.0|
|Kohonen Clustering||2.0 – 3.0|
|Linear Regression||2.0 – 3.0|
|Polynomial Regression||2.0 – 3.0|
|Transform Regression||2.0 – 3.0|
|Association Rules||2.0 – 3.2|
While SPSS Modeler can produce PMML, it does not produce it at these downlevel versions and so PMML from SPSS cannot be used with the Mining Toolkit.
Q: So why use SPSS Modeler published modeler streams vs PMML models?
The short answer is more models, additional flexibility, and support for model deployment and management.
Using SPSS Analytics Toolkit for InfoSphere Streams offers more power than simply exporting the model (as PMML), because it allows you to publish and deploy complete IBM SPSS Modeler streams. That means you can perform data preparation as well as record and field operations, such as aggregating data, selecting records, or deriving new fields, before creating predictions based on a model. You can then further process the model results before saving the data–all simply by executing the published stream.
It supports all the mode types available in the SPSS Modeler palette, and you can combine multiple models in a single published IBM SPSS Modeler stream.
In addition the SPSS Toolkit provides operators that interact with the SPSS model repository. Specifically the following operators are provided:
- SPSSScoring – integrates with SPSS Modeler Solution Publisher to the enable the scoring of your SPSS Modeler designed predictive models in InfoSphere Streams applications
- SPSSPublish – automates the SPSS Modeler Solution Publisher ‚Äėpublish‚Äô function which generates the required executable images needed to refresh the model used in your InfoSphere Streams applications from the logical definition of an SPSS Modeler scoring branch defined in a SPSS Modeler file
- SPSSRepository – detects notification events indicating changes to the deployed models managed in the SPSS Collaboration and Deployment Services repository and retrieves the indicated Modeler file version for automated publish and preparation for use in your InfoSphere Streams applications
Q: What about the R support available in Streams?
Streams provides the R-project Toolkit which contains an operator that facilitates integration between InfoSphere Streams and the R environment.
R is a language and environment for statistical computing and graphics. For example, it provides statistical techniques such as linear and nonlinear modeling, time-series analysis, clustering, and classification. For more information about R, see http://www.r-project.org.
The R-project toolkit contains the RScript operator which maps input tuple attributes to objects that can be used in R commands. It then runs a script that contains R commands and maps the objects that are output from the script to output tuple attributes. Your script provided to the operator can use any appropriate R statements including those that apply data mining algorithms.¬† For more information see the toolkit documentation: here
The R-project toolkit does not have the concept or support for a repository of models and operators for managing the deployment and lifecycle of the models that the SPSS Analytics Toolkit for InfoSphere Streams offers.