Kubernetes with OpenShift World Tour: Get hands-on experience and build applications fast! Find a workshop!

Virtualizing Db2 Warehouse data with data virtualization

This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path.

For decades, companies have tried to break down silos by copying data from different operational systems into central data stores for analysis, such as data marts, data warehouses and data lakes. This is costly and prone to error. Most struggle to manage an average of 33 unique data sources, which are diverse in structure and type, and are often trapped in data silos that are hard to find and access.

With data virtualization, you can query data across many systems without having to copy and replicate data, which reduces costs. It also can simplify your analytics and make them more up to date and accurate because you’re querying the latest data at its source.

In this tutorial, we’re going to learn how to virtualize Db2 Warehouse data with Data Virtualization on IBM Cloud Pak for Data to make queries across multiple data sources.

Learning objectives

In this tutorial you will learn how to:

  • Add data sets to IBM Cloud Pak for Data.
  • Add a data source to Data Virtualization.
  • Virtualize the data and create a joined view.
  • Assign virtualized data to a project.
  • Add roles to users and perform admin tasks.

Prerequisites

Estimated time

This tutorial will take approximately 30-45 minutes to complete.

Steps

Step 1: Get the data

Download the 3 data files:

Step 2: About the data set

The data set used for this tutorial is originally from Watson Analytics and was used on a Kaggle project, it contains information about customer churn for a Telecommunications company. The data is split into three CSV files.

billing.csv

This file has the following attributes:

  • Customer ID
  • Contract (Month-to-month, one year, two year)
  • Paperless Billing (Yes, No)
  • Payment Method (Bank transfer, Credit card, Electronic check, Mailed check)
  • Monthly Charges ($)
  • Total Charges ($)
  • Churn (Yes, No)

customer-service.csv

  • Customer ID
  • Gender (Male, Female)
  • Senior Citizen (1, 0)
  • Partner (Yes, No)
  • Dependents (Yes, No)
  • Tenure (1-100)

products.csv

  • Customer ID
  • Phone Service (Yes, No)
  • Multiple Lines (Yes, No, No phone service)
  • Internet Service (DSL, Fiber optic, No)
  • Online Security (Yes, No, No internet service)
  • Online Backup (Yes, No, No internet service)
  • Device Protection (Yes, No, No internet service)
  • Tech Support (Yes, No, No internet service)
  • Streaming TV (Yes, No, No internet service)
  • Streaming Movies (Yes, No, No internet service)

Step 3: Seeding our Db2 Warehouse Database

We’ll need a place to store our data. For this workshop, we’ve opted to use Db2 Warehouse on our local Cloud Pak for Data cluster. Note that Cloud Pak for Data can work with any database with a JDBC connector, so this is only one of many choices.

Load data into local DB2 Warehouse

These instructions are for loading the data into the local Cloud Pak for Data version of DB2 Warehouse. They will be similar for the IBM Cloud version.

Click the (☰) hamburger menu in the upper left corner, and choose Collect -> My data:

Choose Collect  My data

Go to the Databases tab, click on the 3 vertial lines on the DB2 Warehouse tile, and click Open:

Open Service DB2 Warehouse

Under Menu choose Load and Load Data:

Menu Load Data

Choose Browse files:

DB2 browse files

Navigate to where you cloned this repository, then to data/split/ and choose billing.csv, then click Next.

DB2 navigate to billing.csv

Choose Schema NULLIDRA and click + New table. Under “New Table Name” type “BILLING” and click Create, then Next.

DB2 choose schema and create table

Accept the defaults and click Next. Click Begin Load.

DB2 load final

Repeat for the products.csv file, naming the table PRODUCTS and the customer-service.csv file, naming the table CUSTOMERS.

Step 4: Creating a new IBM Cloud Pak for Data project

Log in to IBM Cloud Pak for Data

Launch a browser and navigate to your IBM Cloud Pak for Data deployment

Cloud Pak for Data login

Create a new project

Go the (☰) menu and click Projects

(☰) Menu -> Projects”/></p>
<p>Click on <em>New project</em></p>
<p><img class=

Create a new project, choose Analytics project, give it a unique name, and click *OK. Click Create on the next screen.

Pick a name

Step 5: Add a new data source connection

For IBM Cloud Pak for Data to read our Db2 Warehouse data we need to add a new Data Source to IBM Cloud Pak for Data. This requires inputting the usual JDBC details.

To get the connection info for you local DB2 Warehouse, go to the (☰) menu, and click on the My Instances option.

(☰) Menu  My Instances

In My instances, go to the Provisioned instances tab. Highlight your local DB2 Warehouse, click the 3 vertical dots on the far right, and then click View Details:

Provisioned local DB2 details

Either keep this window open in a separate tab, or copy the required Connection information: Host, Port, Database name, Username, and Password. You can get the port from the JDBC Connection URL, for example, for the URL jdbc:db2://os-workshop-nov22worker-05.vz-cpd-nov22.com:30290/BLUDB the port is the number after the colin in the URL 30290:

DB2 Connection credentials

Add the new data source

To add a new data source, go to the (☰) menu and click on the Connections option.

(☰) Menu -> Collections”/></p>
<p>At the overview, click <em>Add connection</em>.</p>
<p><img class=

Start by giving your new Connection a name, and select Db2 Warehouse on Cloud as your connection type. More fields should appear. Fill in the new fields with the same credentials for your local Db2 Warehouse connection from the previous section. Click Test Connection. After that succeeds with Success The test connection was successful., click Add to save the connection information.

Add a Db2 Warehouse on Cloud connection

The new connection will be listed in the overview.

Connection has been added!

Virtualize Db2 data with Data Virtualization

NOTE: This section requires Admin user access to the IBM Cloud Pak for Data cluster.

For this section we’ll now use the Data Virtualization tool to import the data from Db2 Warehouse, which is now exposed as an Connection in IBM Cloud Pak for Data.

Add a data source to Data Virtualization

To launch the data virtualization tool, go the (☰) menu and click Collect and then Data Virtualization.

(☰) Menu -> Collect -> Data Virtualization”/></p>
<p>At the empty overview, click <em>Add</em> and choose <em>Add data source</em>.</p>
<p><img class=

Select the data source we made in the previous step, and click Next.

Add the Db2 Warehouse connection

The new connection will be listed as a data source for data virtualization.

Db2 Warehouse connection is now associated with Data Virtualization

Start virtualizing data

In this section, because we now have access to the Db2 Warehouse data, we can virtualize the data to our IBM Cloud Pak for Data project. Click on the Menu button and choose Virtualize.

Menu -> Virtualize”/></p>
<p>Several tables will appear (many are created as sample data when a Db2 Warehouse instance is provisioned) in the table. Find the tables you created earlier, the instructions suggested naming them: <code style=CUSTOMER, PRODUCT and BILLING. Once selected click on Add to cart and then on View Cart.

Choose the tables to virtualize

The next panel prompts the user to choose which project to assign the data to, choose the project you created in the previous exercise. Click Virtualize to start the process.

Add virtualized data to your project

You’ll be notified that the virtual tables have been created! Let’s see the new virtualized data from the Data Virtualization tool by clicking View my data.

We've got virtualized data

Join the virtualized data

Now we’re going to join the tables we created so we have a merged set of data. It will be easier to do it here rather than in a notebook where we’d have to write code to handle three different data sets. Click on any two tables (PRODUCTS and BILLING for instance) and click the Join view button.

Choose to join two tables

To join the tables we need to pick a key that is common to both data sets. Here we choose to map customerID from the first table to customerID on the second table. Do this by clicking on one and dragging it to another. When the line is drawn click on Join.

Map the two customerID keys

In the next panel, we’ll give our joined data a unique name (to be consistent with SQL standards, pick an all uppercase name). I chose XXXBILLINGPRODUCTS (where XXX is my all uppercase user ID). Then, review the joined table to ensure that all columns are present and only one customerID column exists. Click Next to continue.

Review the proposed joined table

Next we choose which project to assign the joined view to, choose the project you created in the previous exercise. Click Create view to start the process.

Add joined data tables to your project

You’ll be notified that the join has succeeded! Click on View my virtualized data. to repeat this again so we have all three tables.

The data join succeeded!

IMPORTANT Repeat the same steps as above, but this time choose to join the new joined view (XXXBILLINGPRODUCTS) and the last virtualized table (CUSTOMERS), to create a new joined view that has all three tables, let’s call it XXXBILLINGPRODUCTSCUSTOMERS. Switching to our project should show all three virtualized tables, and two joined tables. Do not go to the next section until this step is performed.

Our data sets at the end of this section

Grant access to the virtualized data

For other users to have access to the data that you just virtualized, you need to grant them access. Follow these steps to make your virtualized data visible to them.

Go to Data Virtualization from the (☰) menu. Click on Menu -> My virtualized data.

Click on the virtualized data you’ve created, then click the 3 horizontal dots ... to the right of one, and choose Manage access:

Manage access to virtualized data

Click the Specific users button and click + grant access:

Grant Access to specific users

Select the users you wish to grant access to and click Add:

Select Users to Grant Access to

Assign the “Steward” role to the users.

Go to Data Virtualization option from the menu. Click on Manage users

Manage users in Data Virtualization

Click on Add user and ensure all users have the Steward role.

Manage users in Data Virtualization

Step 6: Users Assign virtualized data

Assign the data to your project

From the menu, click on Collections -> Data Virtualization. You’ll be brought to the My data section. Here you should see the data that the administrator has assigned to you. Choose the data sets available, and click Assign to start importing it to your project.

Select the data you want to import

From here, choose the project you previously created.

Assign the data to a project

Switching to our project should show the virtualized tables and the joined tables. Do not go to the next section until this step is performed.

Our data sets at the end of this section

Conclusion

This tutorial explained how to virtualize Db2 Warehouse data with data virtualization on IBM Cloud Pak for Data to make queries across multiple data sources. This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path. To continue the series and learn more about IBM Cloud Pak for Data, take a look at the next tutorial, Data visualization with data refinery.

Scott D’Angelo
Steve Martinelli