This tutorial is part of the Getting started with Watson Studio learning path.

Introduction

The purpose of this tutorial is to demonstrate features within IBM® Watson™ Studio that help you visualize and gain insights into your data, then cleanse and transform your data to build high-quality predictive models.

Prerequisites

To complete the tutorials in this learning path, you will need an IBM Cloud account. You can obtain a free trial account, which gives you access to IBM Cloud, IBM Watson Studio, and the IBM Watson Machine Learning Service.

Estimated time

It should take you approximately 30 minutes to complete this tutorial.

Steps

Set up your environment

The following steps are required to complete all of the tutorials in this learning path.

Create IBM Cloud Object Storage service

An Object Storage service is required to create projects in Watson Studio. If you do not already have a storage service provisioned, complete the following steps:

  1. From your IBM Cloud account, search for “object storage” in the IBM Cloud Catalog. Then, click the Object Storage tile.

    object-storage-tile

  2. Enter a name and select the Standard (free) version of the service.

    object-storage-create

  3. For the Resource Group, you can use the default value, but a better choice is to use a dedicated group that you have created in IBM Cloud. You can find the command for creating new resource groups in IBM Cloud using the Manage > Account menu option, and then navigating to Account resources > Resource groups in the toolbar to the left. The Create button is in the upper right corner of the page.

  4. Click Create.

Create Watson Studio project

If you don’t already have an existing project to use for this learning path, create a new one.

  1. Sign in to Watson Studio using the account that you created for your IBM Cloud account

  2. Click either Create a project or New project.

  3. Select Create an empty project.

    create-empty-project

  4. In the New project window, name the project (for example, “Watson Machine Learning”).

    create-project

  5. For Storage, you should select the IBM Cloud Object Storage service you created in the previous step. If it is the only storage service that you have provisioned, it is assigned automatically.

  6. Click Create.

Provision IBM Cloud services

NOTE: This section discusses creating new services for your project. If you have previously provisioned any of these services, you can choose to use them instead of creating new ones.

Watson Machine Learning service

To provision the Machine Learning service and associate it with the current project:

  1. Select the Settings tab for the project.

  2. Scroll down to the Associated services section.

    add-ml-service

  3. Click Add Service.

  4. Select Watson from the drop-down menu.

  5. On the next page, click Add in the Machine Learning service tile.

  6. On the next page, select the New tab to create a new service.

  7. Keep the Lite plan for now (you can change it later, if necessary).

  8. Scroll down and click Create to create the service.

  9. The Confirm Creation window opens, which lets you specify the details of the service such as the region, the plan, the resource group, and the service name.

    confirm-ml-service

  10. Enter a name for the service instance (optionally, you can prefix the generated name with “watson-machine-learning”).

  11. For the Resource group, you can choose to use the default value, but a better choice is to use a dedicated group that you have created in IBM Cloud. You can find the command for creating new resource groups in IBM Cloud using the Manage > Account menu option, and then navigating to Account Resources > Resource Groups in the toolbar to the left. The Create button can be found in the upper-right corner of the page.

  12. Click Confirm.

IBM Cognos Dashboard Embedded service

To provision the IBM Cognos Dashboard Embedded service and associate it with the current project:

  1. Select the Settings tab for the project.

  2. Scroll to the Associated services section.

  3. Click Add service.

  4. Select Dashboard from the drop-down menu.

    add-dashboard-service

  5. On the next page, select New to create a new service.

  6. Keep the Lite plan for now (you can change it later, if necessary).

  7. Click Create to create the service.

    The Confirm Creation window appears, which lets you specify the details of the service such as the region, the plan, the resource group, and the service name.

    confirm-dashboard-service

  8. Enter a name for the service instance (optionally, you can prefix the generated name with “watson-machine-learning”).

  9. For the Resource group, select the same resource group used with the provisioning of your other IBM Cloud services.

  10. Click Confirm.

Upload data set

Next, you’ll download the data set from Kaggle and upload it to Watson Studio.

  1. Navigate to the URL for the data set on Kaggle (https://www.kaggle.com/sandipdatta/customer-churn-analysis), and download the file to your local desktop.

  2. Rename the file to something more meaningful (for example, ‘customer-churn-kaggle.csv’).

  3. In Watson Studio, select Assets.

  4. If not already open, click the 1001 data icon at the upper right of the panel to open the Files sub-panel. Then, click Load.

    upload-data-set

  5. Drag the file to the drop area to upload the data into Watson Studio.

  6. Wait until the file has been uploaded.

Background

After completing the steps to setting up your environment, you can now focus on the main topic of this tutorial, which is all about data. You’ll learn how to visualize it, then prepare and transform it so that it can be used to build optimized high-quality predictive models.

A classical data science approach to perform these activities is to use the Python programming language running in a Jupyter Notebook. While we cover this method later in the learning path tutorial Build models using Jupyter Notebooks in IBM Watson Studio, this tutorial focuses on alternative ways to achieve the same goal, using features and tools provided by Watson Studio, with no programming required.

Basic visualization in Watson Studio

After data is collected, the next step is referred to as the data understanding phase. This consists of activities that enable you to become familiar with the data, identify data quality problems, and discover first insights into the data.

You can achieve this in Watson Studio by simple user interactions, without a single line of code. To view the data set in Watson Studio, locate the data asset and then click the name of the data set to open it.

select-data-set

Watson Studio shows you a preview of the data in the Preview tab.

data-preview

Alternatively, the Profile tab gives you profiling information that shows the distribution of the values. For numerical features, it also shows the maximum, minimum, mean, and standard deviation for the feature:

data-profile

Notice that although the numerical columns are identified to be of type varchar, the profiler is smart enough to recognize these to be numerical columns, convert them implicitly, and compute the mean and the standard deviation.

To generate the profile the first time:

  1. Select the Profile tab.

  2. Invoke the Create Profile command.

  3. Wait a short while and then refresh the page.

Notice that the churn parameter does not provide a balanced distribution of churn and no-churn observations. This might mean that you should adopt cross-validation strategies during the model building and evaluation phase.

churn-values

More visualizations using the Cognos Dashboard service

You can look further into the data set by creating a dashboard with associated visualizations. This basically requires three steps: creating an empty dashboard, adding a data source to be used for visualizations, and adding appropriate visualizations to the dashboard.

To create the dashboard:

  1. Click Add to project.

  2. Click Dashboard to create a new dashboard.

  3. Follow these steps in the New Dashboard page:

    1. Enter a Name for the dashboard (for example, ‘customer-churn-dashboard’.)

    2. Provide a Description for the dashboard (optional).

    3. For Cognos Dashboard Embedded Service, select the dashboard service that you created previously.

      create-dashboard

    4. Click Save.

  4. On the next page, select the Freeform template.

    free-form-diagram

  5. Keep the default setting that creates a Tabbed dashboard.

  6. Click OK to create an empty freeform dashboard with a single Tab.

To add a data connection:

  1. Click the Add a source button (the + icon) in the upper-left part of the page:

    select-source

  2. Click Select to select the customer churn data source.

  3. Back in the dashboard, select the newly imported data source.

  4. Preview the data source by clicking the table icon on the lower-right of the panel.

    show-churn-data

  5. Expand the data source by clicking > so that you can view the columns.

    data-source-columns

Notice that you can view and change the properties of the columns. Simply click the 3 dots to the right of the column name, then select Properties in the pop-up menu. This displays a window as shown above, and allows you to alter the default setting for Usage (Identifier, Attribute, and Measure) and Aggregate Function (Count, Count Distinct, Maximum, and Minimum). For now, you should be fine with the default settings.

To create a visualization that shows the distribution of churns and no-churns as a pie chart:

  1. Select the Visualizations icon in the toolbar to the left.

  2. Select a Pie chart.

  3. This creates a form for specifying the properties of the pie chart using, for example, columns of the data set.

    create-visualization

  4. Select the Sources icon in the toolbar to the left (located above the Visualizations icon).

  5. Drag the churn column onto the Segments property of the pie chart.

  6. Drag the churn column onto the Size column of the pie chart.

    visualization-props

  7. Click the Collapse arrow in the upper right of the form, as shown above. This minimizes the pie chart and renders it on the dashboard.

  8. Select the Tab at the upper left, then click the Edit the title button.

    initial-dashboard

  9. Provide a title for the tab (for example, ‘Customer Churn’).

Follow these steps and create two more visualizations:

  • A stacked column chart showing State (visualization property Bars) and Churn (Length, Color) on the X and Y axis, respectively

  • A pie chart showing the distribution of International Plan (Segments, Length)

This should result in a dashboard similar to the following image. Notice that you can move visualizations on the dashboard using the Move widget command located on the top of each visualization.

final-dashboard

The dashboards are dynamic by nature and support exploration of the data using filters. In the visualization that shows ‘International Plan,’ click the slice associated with the value ‘yes.’ This creates a filter that will apply to all other (connected) visualizations on the current dashboard.

filtered-dashboard

Notice that the slice for churn in the visualization to the left has increased significantly. This tells you that clients on an international plan are more likely to churn than clients that are not on an international plan. To remove the filter, click the filter icon for the visualization in the upper-right corner, then select the delete filter button that pops up (the icon is a cross in a circle). Clicking the slice again achieves the same effect.

Data preparation and transformation using Refine

The data preparation phase covers all activities needed to construct the final data set that is fed into the machine learning service. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleansing of data for the modeling tools. This can involve turning categorical features into numerical ones, normalizing the features, and removing columns not relevant for prediction (for example, the phone number of the client).

If you would just like to create a model semi-automatically or fully automated using the IBM Watson AutoAI and Watson Machine Learning service, no more activity is needed during data preparation (for the current data set) because the AutoAI service takes care of these operations in the background. We show how this is done in the Automate model building in IBM Watson Studio tutorial of this learning path.

Alternatively, Watson Studio offers a service called Data Refine that lets you clean up and transform data without any programming. To run the service:

  1. Click Add to project in the top bar of the project overview page.

  2. In the Choose asset type window, select Data Refinery Flow to create a new flow.

  3. On the next page, select the Customer Churn data set and click Add.

  4. This opens the data source for you so that you can transform and view it.

Note that you can also initiate the Data Refine service by clicking on Refine from the Preview panel of the data set.

start-refine

The Data Refine service is then loaded and displays the following table.

refine-data-set

Notice the tabs to the top left, which let you view the data in a tabular form for profiling it (as in the previous section) and for creating custom visualizations of the data.

To transform the data:

  1. Select the 3 dots in the “phone number” column and invoke the Remove command in the pull-down menu. This deletes the column.

    remove-phone-num

  2. Select the total days minutes feature column. This is a really a String type but should be numeric.

  3. Click the Operation button in the upper-left corner, which shows you some available transformations.

    transform-operation

You could convert the column to another type (say float or integer). However, we will not do this for now because the Machine Learning service does it for us automatically behind the scenes. But in principle, you could decide to turn the “total day minutes” column into an integer column and round it to show zero decimals. Alternatively, you could convert it into a floating type. For now, let’s just continue executing the flow just defined and view the result.

  1. Click the Run Data Refinery flow button in the toolbar. Its icon is an arrow.

  2. Select the option to Save and create a job.

    save-and-create-job

  3. On the next page, you can name the flow and give it an optional description. Note that the output file will be named the same as the asset name, but with an added “shaped” suffix.

  4. Click Create and run.

The resulting window shows the input file, the output file, and the runs. Notice that there is also a tab where you can schedule the flow so that it is executed automatically.

refine-job-status

Go back to your project and check that the output file and the flow are now part of your project assets.

new-refine-flow-asset

If you click on the newly created flow asset, you see that the “phone number” column has been removed.

Data Refinery Flows allow you to perform quick transformations of data without the need for programming. It is by no means a replacement for Jupyter Notebooks and the powerful capabilities of numpy and pandas, but for a quick clean-up process it comes in handy. For more complex transformations and computations, you should revert to using other options such as Jupyter Notebooks or SPSS Modeler flows (which will be covered in other tutorials included in this learning path).

Conclusion

This tutorial covered some of the tools available in Watson Studio for visualizing, preparing, and transforming you data.

Topics included previewing and profiling your data assets, building a Cognos Dashboard to build more visualizations, and using the Data Refine Flow tool to perform data transformations

The remaining tutorials in this learning path discuss alternative ways to accomplish these tasks as well as taking the next step, using the data to build and deploy predictive models. The next tutorial demonstrates the IBM Studio AutoAI experiment tool, which is a non-programmatic approach to creating, evaluating, deploying, and testing machine learning models.