Accelerate the value of multicloud with collaborative DevSecOps Learn more

Data visualization with data refinery

This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path.

Data refinery is part of IBM Watson® and comes with IBM Watson Studio on the IBM Public Cloud, and IBM Watson Knowledge Catalog running on-premises using IBM Cloud Pak® for Data. It’s a self-service data-preparation client for data scientists, data engineers, and business analysts. With it, you can quickly transform large amounts of raw data into quality consumable information that’s ready for analytics. Data refinery makes it easy to explore, prepare, and deliver data that people across your organization can trust.

Learning objectives

After following this tutorial, you will learn how to:

  • Load data into the IBM Cloud Pak for Data platform for use with data refinery.
  • Transform a sample data set, either by entering command-line R code or selecting menu operations.
  • Use Data Flow steps to keep track of your work.
  • Visualize the data with charts and graphs.

Prerequisites

Estimated time

Completing this tutorial should take about 45 minutes.

Steps

Step 1. Load the billing.csv data into data refinery

  1. Download the billing.csv file.

  2. From the Project home, click on the Assets tab. Next, either drag and drop the downloaded billing.csv file to the right-hand side pane where it says Drop files here or browse for files to upload or click on browse and choose the downloaded billing.csv file. Add the billing.csv data

  3. Click on the newly added billing.csv file.

  4. You should be able to see the data as shown below. Click on Refine. View uploaded data

  5. Data refinery should launch and open the data. Data Refinery view of the BILLING table

  6. Click the X by the Details button to close it.

Step 2. Refine your data

We’ll start out on the Data tab.

Transform your sample data set by entering R code in the command line or selecting operations from the menu. For example, type filter on the command line and observe that autocomplete will give hints on the syntax and how to use the command.

Command line filter

Alternatively, hover over an operation or function name to see a description and detailed information for completing the command. When you’re ready, click Apply to apply the operation to your data set.

Click the +Operation button.

Choose Operation button

We notice that TotalCharges is a string, but since it represents a decimal number, let’s convert the values to decimal. Choose the Operator Convert Column Type.

Choose Convert Column Type

Click + Select column, then pick Column -> TotalCharges and Type -> Decimal, then click Apply.

Convert to Decimal

We want to make sure that there are no empty values, and there happen to be some for the TotalCharges column, so let’s fix that. Click on the operation Filter and choose the TotalCharges column from the drop-down, operator Is empty, then Apply.

Filter is empty

We can see that there are only three rows with an empty value for TotalCharges.

Filter is empty results

It should be safe to just drop these rows from the data set, so let’s do that.

Remove the filter you just added. You can delete it using one of the following methods:

  • Hover over the corresponding step in the Steps section and the delete icon (trash can) will appear. Click on this icon to remove the filter.
  • Click the undo arrow at the top of the page.

Remove applied filter

Next, choose the operation Remove empty rows, select the TotalCharges column, click Next, then click Apply on the next screen.

Remove empty rows

Finally, we can remove the CustomerID column, since that won’t be useful for training a machine learning model in the next exercise. Choose the Remove operator, then Change column selection. Under Select a column, pick CustomerID, then Next, then Apply.

Remove CustomerID column

Step 3. Use data flow steps to keep track of your work

What if we do something we don’t want? Data refinery keeps track of the steps and we can undo (or redo) an action using the circular arrows.

Undo recent action

As you refine your data, the IBM data refinery keeps track of the steps in your data flow. You can modify them and even select a step to return to a particular moment in your data’s transformation.

To see the steps in the data flow that you have performed, click the Steps button. The operations you have performed on the data will be shown.

Data flow steps

You can modify these steps in real time and save for future use.

Step 4. Profile the data

Clicking on the Profile tab will bring up a quick view of several histograms about the data.

Data Refinery Profile tab

You can get insights into the data from the histograms:

  • Twice as many customers are month-to-month as are two-year or one-year contract.
  • More choose paperless billing, but around 40 percent still prefer a paper bill sent to them.
  • You can see the distribution of MonthlyCharges and TotalCharges.
  • From the Churn column, you can see that a significant number of customers will cancel their service.

Step 5. Visualize with charts and graphs

  1. Choose the Visualizations tab to bring up an option to choose which columns to visualize. Under Columns to Visualize, choose TotalCharges and click Visualize data. Visualize TotalCharges column

  2. We first see the data in a histogram by default. You can choose other chart types. We’ll pick Scatter plot next by clicking on it. Visualize TotalCharges histogram

  3. In the scatter plot, choose TotalCharges for the x-axis, MonthlyCharges for the y-axis, and Churn for the color map. set x- and y- axes and Color map

Scroll down and give the scatter plot a title and sub-title if you wish. Under the Actions panel, notice that you can perform tasks such as Start over, Download chart details, Display data label in chart, Download chart image, or set Global visualization preferences. (Note: Hover over the icons to see the names). Click on the “gear” icon in the Actions panel:

Visualize set titles and choose preferences

  1. We see that we can do things in the Global visualization preferences for Titles, Tools, Color schemes, and Notifications. Click on the Theme tab and update the Color scheme to Vivid, then click the Apply button. Visualize set vivid

Now the colors for all of our charts will reflect this:

Visualize show vivid

Conclusion

This tutorial showed you a small sampling of the power of the IBM data refinery on IBM Cloud Pak for Data. The tutorial also explained how you can transform data using R code, at the command line, using various operations on the columns such as changing the data type, removing empty rows, or deleting columns altogether. The tutorial also explained that all the steps in our data flow are recorded, so you can remove steps, repeat them, or edit an individual step. The tutorial showed how you can quickly profile the data to see histograms and statistics for each column. And finally, it explained how you can create more in-depth visualizations and create a scatter plot mapping TotalCharges vs. MonthlyCharges, with the churn results highlighted in color.

This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path. To continue the series and learn more about IBM Cloud Pak for Data, take a look at Find, prepare, and understand data with Watson Knowledge Catalog.