Whether you are new to IBM SPSS Modeler or a long-time user, it is helpful to be aware of all the modeling nodes available. Just like a carpenter needs a tool for every job, a data scientist needs an algorithm for every problem. I collected descriptions for each modeling node from the documentation and summarized them to help provide a quick overview of the algorithms available natively in the software. The nodes below are grouped based on the type of data mining task they perform (Classfication, Association, and Segmentation). The nodes in this list are available in IBM SPSS Modeler version 17.

Classification Model Nodes (1)

classification1

Name of Modeling Node Description
Decision Tree Nodes

  • C&R Tree
  • QUEST
  • CHAID
  • C5.0
  • The algorithms are similar in that they can all construct a decision tree by recursively splitting the data into smaller and smaller subgroups
  • Each algorithm has important differences which should be taken into account during model building
Decision List
  • Identifies subgroups or segments that show a higher or lower likelihood of a binary (yes or no) outcome relative to the overall sample
  • Allows complete control over the model, enabling you to edit segments, add your own business rules, specify how each segment is scored, and customize the model in a number of other ways to optimize the proportion of hits across all segments
Linear Models
  • Predict a continuous target based on linear relationships between the target and one or more predictors
  • Relatively simple and creates an easily interpreted mathematical formula for scoring
Principal Component Analysis, Factor Analysis (PCA/Factor)
  • Provides powerful data-reduction techniques to reduce the complexity of your data
  • Two similar but distinct approaches are provided in one node
  • The goal is to find a small number of derived fields that effectively summarize the information in the original set of fields
Neural Network
  • Approximates a wide range of predictive models with minimal demands on model structure and assumptions
  • Relationships are determined during the learning process
  • The trade-off for this flexibility is that the neural network is not easily interpretable
Feature Selection
  • Reduce the choice of hundreds, or even thousands, of fields that can potentially be used as inputs for a data mining problem
  • Used to identify the fields that are most important for a given analysis

Classification Model Nodes (2)

classification2

Name of Modeling Node Description
Discriminant
  • Builds a predictive model for group membership
  • Model is composed of a discriminant function (or, for more than two groups, a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups
Logistic
  • Statistical technique for classifying records based on values of input fields
  • Both binomial models (for targets with two discrete categories) and multinomial models (for targets with more than two categories) are supported
Generalized Linear Model (GenLin)
  • Expands the general linear model so that the dependent variable is linearly related to the factors and covariates via a specified link function
  • Allows for the dependent variable to have a non-normal distribution
  • Covers widely used statistical models, such as linear regression for normally distributed responses, logistic models for binary data, loglinear models for count data, complementary log-log models for interval-censored survival data, plus many other statistical models through its very general model formulation
Generalized Linear Mixed Models (GLMM)
  • Extend the linear model so that:
    • The target is linearly related to the factors and covariates via a specified link function
    • The target can have a non-normal distribution
    • The observations can be correlated
  • Cover a wide variety of models, from simple linear regression to complex multilevel models for non-normal longitudinal data.
Cox
  • Builds a predictive model for time-to-event data
  • Produces a survival function that predicts the probability that the event of interest has occurred at a given time t for given values of the predictor variables
Support Vector Machine (SVM)
  • Enables you to use a support vector machine to classify data
  • Particularly suited for use with datasets with a large number of predictor fields
Bayesian Network
  • Enables you to build a probability model by combining observed and recorded evidence with “common-sense” real-world knowledge to establish the likelihood of occurrences by using seemingly unlinked attributes
  • Focuses on Tree Augmented NaĂŻve Bayes (TAN) and Markov Blanket networks that are primarily used for classification
Self-Learning Response Model (SLRM)
  • Enables you to build a model that you can continually update, or re-estimate, as a dataset grows without having to rebuild the model every time using the complete dataset
  • For example, this is useful when you have several products and you want to identify which product a customer is most likely to buy if you offer it to them
  • Allows you to predict which offers are most appropriate for customers and the probability of the offers being accepted
K-Nearest Neighbor (KNN)
  • Method for classifying cases based on their similarity to other cases
  • Similar cases are near each other and dissimilar cases are distant from each other
  • The distance between two cases is a measure of their dissimilarity
Time Series to Model
  • Attempts to discover key causal relationships in time series data
  • Builds an autoregressive time series model for each target and includes only those inputs that have a causal relationship with the target
  • Differs from traditional time series modeling where you must explicitly specify the predictors for a target series
Spatio-Temporal Prediction (STP)
  • Uses data that contains location data, input fields for prediction (predictors), a time field, and a target field
  • Each location has numerous rows in the data that represent the values of each predictor at each time of measurement
  • Used to predict target values at any location within the shape data that is used in the analysis

Association Model Nodes

AssociationGroup

Name of Modeling Node Description
Apriori
  • Discovers association rules in the data
  • To create an Apriori rule set, you need one or more Input fields and one or more Target fields
CARMA
  • Uses an association rules discovery algorithm to discover association rules in the data
  • In contrast to Apriori, the CARMA node does not require Input or Target fields. This is integral to the way the algorithm works and is equivalent to building an Apriori model with all fields set to Both
Sequence
  • Discovers patterns in sequential or time-oriented data, in the format bread -> cheese
  • The elements of a sequence are item sets that constitute a single transaction
  • A sequence is a list of item sets that tend to occur in a predictable order
Association Rules
  • The Association Rules node extracts a set of rules from the data, pulling out the rules with the highest information content. The Association Rules node is very similar to the Apriori node, however, there are some notable differences:
    • The Association Rules node cannot process transactional data
    • The Association Rules node can process data that has the List storage type and the Collection measurement level
    • The Association Rules node can be used with IBM® SPSS® Analytic Server. This provides scalability and means that you can process big data and take advantage of faster parallel processing
    • The Association Rules node provides additional settings, such as the ability to restrict the number of rules that are generated, thereby increasing the processing speed
    • Output from the model nugget is shown in the Output Viewer

Segmentation Model Nodes

SegmentationGroup

Name of Modeling Node Description
K-Means
  • Provides a method of cluster analysis. It can be used to cluster the dataset into distinct groups when you don’t know what those groups are at the beginning
  • Instead of trying to predict an outcome, K-Means tries to uncover patterns in the set of input fields
  • Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar
Kohonen
  • Kohonen networks are a type of neural network that perform clustering, also known as a knet or a self-organizing map
  • Used to cluster the dataset into distinct groups when you don’t know what those groups are at the beginning
TwoStep (Twostep-AS is a similar node that is only available on IBM SPSS Analytic Server)
  • A two-step clustering method
  • The first step makes a single pass through the data, during which it compresses the raw input data into a manageable set of subclusters
  • The second step uses a hierarchical clustering method to progressively merge the subclusters into larger and larger clusters, without requiring another pass through the data
Anomaly
  • Used to identify outliers, or unusual cases, in the data
  • Anomaly detection models store information on what normal behavior looks like
  • Particularly useful in applications, such as fraud detection, where new patterns may constantly be emerging
  • Anomaly detection is an unsupervised method, which means that it does not require a training dataset containing known cases of fraud to use as a starting point

Automated Modeling Nodes

AutoNodes

Name of Modeling Node Description
Automatic Modeling Nodes

  • Auto Classifier
  • Auto Numeric
  • Auto Clustering
  • The automated modeling nodes estimate and compare a number of different modeling methods, enabling you to try out a variety of approaches in a single modeling run
  • You can select the modeling algorithms to use, and the specific options for each, including combinations that would otherwise be mutually-exclusive
  • For example, rather than choose between the quick, dynamic, or prune methods for a Neural Net, you can try them al
  • The node explores every possible combination of options, ranks each candidate model based on the measure you specify, and saves the best for use in scoring or further analysis
Time Series
  • Estimates exponential smoothing
  • Univariate Autoregressive Integrated Moving Average (ARIMA)
  • Multivariate ARIMA (or transfer function) models for time series
  • Produces forecasts based on the time series data

 

As you can see, IBM SPSS Modeler offers many algorithms that are well suited for building models to make predictions or to better understand your data. If you are interested in more information on any of these modeling nodes please see the documentation here, or post a question in the IBM SPSS Predictive Analytics Community!

1 comment on"IBM SPSS Modeler – Modeling Nodes"

  1. Felipe Rodriguez October 01, 2016

    Hi Greg: I have been trying to find how does clustering methods in SPSS modeler cluster variables that are categorical? Do you have a text source that specifies what all clustering algorithms do with this type of variables?

Join The Discussion

Your email address will not be published. Required fields are marked *