IBM SPSS Modeler provides predictive analytics to help you uncover data patterns, gain predictive accuracy and improve decision making. This tutorial demonstrates an end-to-end flow of how to use SPSS Modeler on Watson Studio, ingest the data in a Db2 Warehouse database, perform analytics, and store back the results as a new table in the same database.
Learning objectives
This tutorial will show you how to:
- Add the DB2 Warehouse Connection in Watson Studio.
- Create a new SPSS Modeler stream or add an existing SPSS Modeler.
- Run the SPSS Modeler and store output on DB2 Warehouse.
Prerequisites
- IBM Cloud Account
- Object Storage Service Instance from the IBM Cloud catalog
- Watson Studio Service Instance from the IBM Cloud catalog
- DB2 Warehouse Service Instance from the IBM Cloud catalog
- For this tutorial, we will be using Titanic Dataset from Kaggle which can be found here. Once you download the dataset, unzip the file onto your local file system.
Estimated time
Completing this tutorial should take about 30 minutes.
Steps
Step 1: Load sample data in Db2 Warehouse
- Open your IBM Cloud Dashboard and under
Cloud Foundry Services
open your createdDB2 Warehouse instance
. - Click on
Open Console
in the page opened, as shown below.
- Once the service opens up, from the side menu, click on the
load
option.
- Select the
train.csv
file from the unzipped data folder from Kaggle.
- Once you load select your appropriate Schema, usually specified as DASH followed by a series of numbers. Click on the
New Table
option as shown.
- Name it
TITANIC_DATA
or any other name of your choice. And click onBegin Load
.
Step 2: Add the DB2 Warehouse connection in Watson Studio
- Open your Watson Studio from IBM Cloud Dashboard and navigate to the created project or create a new
Modeler
Project and make sure you link your Cloud Object Storage instance to the Project.
- Click on the
Add to Project
button and selectConnection
.
- Select your
DB2 Warehouse
instance created on IBM Cloud.
- The details would already be filled and click the
Create
button.
Step 3: Create the SPSS Modeler and insert data from Db2 Warehouse
- Click on the
Add to Project
button and selectModeler
.
- Enter the a name for the modeler and make sure the below options are selected, and click on
Create
.
- On the Right side menu under the
Import
tab drag and drop theData Assets
node.
- Click on the
Change Data Asset > Connections > DB2Warehouse
and select the your schema or default schema (starting with DASH) and the uploaded table. Next, click onSave
.
Step 4: Visualize, analyze and perform feature selection with SPSS Modeler
- Given a dataset, SPSS Modeler provides many visualisation tools to understand the data. In this section, the tutorial will show how to create these Visualisations and get insights from the data.
Visualize
- In the options of the
Data Asset
node, click on thePreview
option and go toVisualizations
tab.
- Histograms are used to show the distribution of the data, in this example, plot a graph by taking
Age
in the x-axis and split-bysex
.
From this graph we can see that the ages of the passengers, follows a normal distribution, i.e, most people have an age range of 20-55, and there are fewer people who are less than 20 and greater than 50. We can also see that, for some age groups, there are more men than women.
- A bar diagram makes it easy to compare sets of data between different groups at a glance. Here we see the comparison between the number of male survivors and the number of female survivors. Taking
sex
as the category, Summary value ascount
and the value option asSurvived
.
- A pie chart is best used when trying to work out the composition of something. In this case we give the category as
Cabin
.
From this pie chart, we can see that 80% of the Cabin
column has NaNs that is missing values. So, we can make a conclusion to drop this column since we know it cannot affect the target, in our case the Survived column.
Feature selection
- Before using Feature Selection, we need to prepare the dataset. First we use
Filler
to fill in all missing values with null and NaN values. Next, we filter out theCabin
column from the Visualization stage usingFilter
node. And, finally we assign the correctType
.
- In the
Type
node adjust theSurvived
column’sRole
property asTarget
.
- Use the Feature Selection under the
Modeling
tab and run the modeler.
- A model node will be generated, while connecting to a
Table
node in theOutput
tab, you can see it filters out the unimportant columns.
Note: Nodes such as Derive and Merge can be used to create new columns from existing columns and merge two dataframes.
Step 5: Save data back to Db2 Warehouse
- Add the
Data Asset Export
node fromExport
tab. - Follow the steps from 1. Load Sample Data in Db2 warehouse to connect your Db2 Warehouse instance and select any table.
- Before saving the changes for the node. Enter a name for the output table, make sure the name is unique to the table names within the schema.
- Run the modeler flow and the desired output will be saved in your
Db2 Warehouse
.
Summary
In this tutorial, you learned how to:
- Load data from DB2 Warehouse onto SPSS Modeler.
- Work with SPSS nodes to perform the initial Visualizations, pre-processing and Feature Engineering of a given dataset.
- Store back data directly to Db2 Warehouse.