Data Virtualization on IBM Cloud Pak for Data

This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path.

For decades, companies have tried to break down silos by copying data from different operational systems into central data stores for analysis, such as data marts, data warehouses, and data lakes. This is often costly and prone to error. Most struggle to manage an average of 33 unique data sources, which are diverse in structure and type, and are often trapped in data silos that are hard to find and access.

With data virtualization, you can query data across many systems without having to copy and replicate data, which helps reduce costs. It also can simplify your analytics and make them more up to date and accurate because you’re querying the latest data at its source.

In this tutorial, we’re going to learn how to use data virtualization on IBM Cloud Pak® for Data to make queries across multiple data sources like Netezza® Performance Server and Db2® Warehouse.

Learning objectives

In this tutorial, you will learn how to:

  • Add datasets to IBM Cloud Pak for Data.
  • Add a data source for data virtualization.
  • Virtualize the data and create a joined view.
  • Assign virtualized data to a project.
  • Add roles to users and perform admin tasks.

Prerequisites

Estimated time

This tutorial will take approximately 30-45 minutes to complete.

Step 1. Get the data

Download the three data files:

  1. billing.csv
  2. customer-service.csv
  3. products.csv

Step 2. About the dataset

The dataset used for this tutorial is originally from Watson Analytics®. It was used on a Kaggle project, and it contains information about customer churn for a telecommunications company. The data is split into three CSV files.

billing.csv

This file has the following attributes:

  • Customer ID
  • Contract (Month-to-month, one year, two year)
  • Paperless Billing (Yes, No)
  • Payment Method (Bank transfer, Credit card, Electronic check, Mailed check)
  • Monthly Charges ($)
  • Total Charges ($)
  • Churn (Yes, No)

customer-service.csv

  • Customer ID
  • Gender (Male, Female)
  • Senior Citizen (1, 0)
  • Partner (Yes, No)
  • Dependents (Yes, No)
  • Tenure (1-100)

products.csv

  • Customer ID
  • Phone Service (Yes, No)
  • Multiple Lines (Yes, No, No phone service)
  • Internet Service (DSL, Fiber optic, No)
  • Online Security (Yes, No, No internet service)
  • Online Backup (Yes, No, No internet service)
  • Device Protection (Yes, No, No internet service)
  • Tech Support (Yes, No, No internet service)
  • Streaming TV (Yes, No, No internet service)
  • Streaming Movies (Yes, No, No internet service)

Step 3. Set up the project and provision data virtualization on IBM Cloud Pak for Data

Log in to IBM Cloud Pak for Data

Launch a browser and navigate to your IBM Cloud Pak for Data deployment.

IBM Cloud Pak for Data login

Create a new IBM Cloud Pak for Data project

  1. Go the hamburger (☰) menu and click Projects. Projects

  2. Click on New project. Start a new project

  3. Select Create an empty project. Create empty project

  4. Provide a name and optional description for the project and click Create. Pick a name

Provision data virtualization on IBM Cloud Pak for Data

Go to the Services tab. Under Data sources, choose the Data Virtualization tile. Click the three vertical dots and choose Deploy.

Provision data virtualization

Follow the instructions to deploy data virtualization.

For deployment using Managed OpenShift®, you must do the following:

  1. IMPORTANT: Decide whether to check the Updated the kernel semaphore parameter checkbox.
  2. IMPORTANT: Do NOT choose the defaults for storage. You must choose ibmc-file-gold-gid as the storage class.

Step 4. Set up the database

IBM Cloud Pak for Data can work with any database with a JDBC connector. For this tutorial we will be using IBM Db2 Warehouse on cloud, IBM Db2 local and Netezza® Performance Server.

Set up Netezza Performance Server

Before you create connection to IBM Netezza Performance Server (NPS), you should create required tables and load the csv data into IBM NPS server using nzload cli. To install nzload cli, follow the instructions.

Login to your IBM NPS console and create 3 tables for billing, customer and product. Note that the tables should exist before you load the data using nzload. Then you can use you nzload cli command to load the csv data to your NPS database. Please download the data from above for billing, customer and product:

bash

nzload -u <user> -pw <password> -host <host> -db <database> -t <table name> -delim ',' -df <csv file name>

If the nzload cli is not supported for example in Mac OSX, you will have to create insert statments for the csv data provided and run it from the Netezza console. This might take little longer than nzload command.

Set up the Db2 Warehouse on IBM Cloud

We’ll need a place to store our data. It is suggested to use Db2 Warehouse on IBM Cloud to best conserve resources on the cluster. If you wish to use the local Db2 on the cluster, skip this section and set up the local Db2 warehouse on IBM Cloud Pak for Data instead.

NOTE: Ensure that you have provisioned Db2 warehouse on IBM Cloud before proceeding.

  1. Get connection details for Db2 Warehouse on IBM Cloud.

  2. Go to Service Credentials and click New credential +. Click the Copy to clipboard icon and save the credentials for later. Get IBM Cloud Db2 credentials

  3. Now go to Manage and click Open Console. Open IBM Cloud Db2 console

Get SSL certificate for Db2 Warehouse on IBM Cloud

You will need an SSL cert for IBM Cloud Pak for Data to use the IBM Cloud Db2 Warehouse instance.

In the Db2 Warehouse console, from the upper-left hamburger (☰) menu, click CONNECTION INFO > Connection Information, then Download SSL Certificate.

Download SSL certificate

You will need to convert the SSL certificate from .crt to a .pem file using OpenSSL. Run the following command:

bash
openssl x509 -in DigiCertGlobalRootCA.crt -out DigiCertGlobalRootCA.pem -outform PEM -inform DER

Seed the Db2 Warehouse on IBM Cloud

  1. From the upper-left hamburger (☰) menu, click LOAD > Load data. Load data into IBM Cloud Db2

  2. Click browse files and select the billing.csv file that was downloaded earlier, then click Next. Add billing.csv file to IBM Cloud Db2

  3. Choose Schema NULLIDRA and click + New table. Under Create a new Table, provide BILLING as the name of the table and click Create, then Next. Create BILLING table in IBM Cloud Db2

  4. Accept the defaults and click Next. On the next screen, click Begin Load. Accept defaults in IBM Cloud Db2

Repeat for the products.csv file, naming the table PRODUCTS and the customer-service.csv table CUSTOMERS.

Now that the Db2 warehouse has been set up on IBM Cloud, you may now skip to add a new data source connection.

Set up the local Db2 Warehouse on IBM Cloud Pak for Data

These instructions are for loading the data into the local CP4D version of Db2 Warehouse. If you’ve used the IBM Cloud instance of Db2 Warehouse, you can skip to the next section. It is suggested to use Db2 Warehouse on IBM Cloud to conserve resources on the CPD cluster as described in set up the Db2 warehouse on IBM Cloud. If you wish to use the local Db2 on the cluster, continue with this section.

Note that IBM Cloud Pak for Data can work with any database with a JDBC connector, so Db2 warehouse is only one of many choices.

NOTE: Ensure that you have provisioned the local Db2 warehouse on IBM Cloud Pak for Data before proceeding.

IMPORTANT: At least one OpenShift node should have the tag for db2wh (such as icp4data=database-db2wh); otherwise, you will not be able to provision the local Db2 warehouse. The tag can be set using the following command:

oc label node <NODE> icp4data=database-db2wh

Get connection details for local Db2 warehouse

  1. To get the connection info for you local Db2 Warehouse, go to the hamburger (☰) menu, and click on the My Instances option. Menu  My Instances

  2. Now go to the Provisioned instances tab and click on the row for your local Db2 Warehouse. This will open the details of the Db2 warehouse instance. Provisioned local Db2 details

  3. Either keep this window open in a separate tab, or copy the required connection information: Host, Port, Database name, Username, and Password. You can get the port from the JDBC Connection URL; for example, for the URL jdbc:db2://10.221.243.172:31956/BLUDB, the port is the number after the colon in the URL: 31956. Db2 Connection credentials

Seed the local Db2 warehouse on IBM Cloud Pak for Data

  1. Click the hamburger (☰) menu in the upper-left corner and choose Collect > My data. Choose Collect My data

  2. Go to the Databases tab, click on the three vertical lines on the Db2 Warehouse tile, and click Open database. Open Service Db2 Warehouse

  3. Under Summary, choose Load and Load Data. Menu Load Data

  4. Choose Browse files and select the billing.csv file downloaded earlier, then click Next. Db2 browse files

  5. Choose Schema NULLIDRA and click + New table. Under New Table Name, type BILLING, click Create, then Next. Create BILLING table in local Db2 warehouse

  6. Accept the defaults and click Next. On the next screen, click Begin Load. Accept defaults in local Db2 warehouse

Repeat for the products.csv file, naming the table PRODUCTS and the customer-service.csv table CUSTOMERS.

Step 5. Set up the local Db2 Warehouse on IBM Cloud Pak for Data

These instructions are for loading the data into the local IBM Cloud Pak for Data version of Db2 Warehouse. If you’ve used the IBM Cloud instance of Db2 Warehouse, you can skip to the next section.

It is suggested to use Db2 Warehouse on IBM Cloud to conserve resources on the CPD cluster as described in set up the Db2 warehouse on IBM Cloud. If you wish to use the local Db2 on the cluster, continue with this section.

Note that IBM Cloud Pak for Data can work with any database with a JDBC connector, so Db2 warehouse is only one of many choices.

NOTE: Ensure that you have provisioned the local Db2 warehouse on IBM Cloud Pak for Data before proceeding.

IMPORTANT: At least one OpenShift node should have the tag for db2wh (such as icp4data=database-db2wh); otherwise, you will not be able to provision the local Db2 warehouse. The tag can be set using the following command:

oc label node <NODE> icp4data=database-db2wh

Get connection details for local Db2 warehouse

To get the connection info for your local Db2 Warehouse, go to the hamburger (☰) menu and click on the My Instances option.

Menu  My Instances

Now go to the Provisioned instances tab and click on the row for your local Db2 Warehouse. This will open the details of the Db2 warehouse instance.

Provisioned local Db2 details

Either keep this window open in a separate tab or copy the required connection information: Host, Port, Database name, Username, and Password. You can get the port from the JDBC Connection URL; for example, for the URL jdbc:db2://10.221.243.172:31956/BLUDB, the port is the number after the colon in the URL: 31956.

Db2 Connection credentials

Seed the local Db2 warehouse on IBM Cloud Pak for Data

  1. Click the hamburger (☰) menu in the upper-left corner and choose Collect > My data. Choose Collect My data

  2. Go to the Databases tab, click on the three vertical lines on the Db2 Warehouse tile and click Open database. Open Service Db2 Warehouse

  3. Under Summary, choose Load > Load Data. Menu Load Data

  4. Choose Browse files and select the billing.csv file downloaded earlier, then click Next. Db2 browse files

  5. Choose Schema NULLIDRA and click + New table. Under New Table Name, type BILLING, then click Create > Next. Create BILLING table in local Db2 warehouse

  6. Accept the defaults and click Next. On the next screen, click Begin Load. Accept defaults in local Db2 warehouse

Repeat for the products.csv file, naming the table PRODUCTS and the customer-service.csv table CUSTOMERS.

Step 6. Add a new data source connection

IBM Cloud Pak for Data can work with any database such as IBM Db2 Warehouse on cloud, IBM Db2 local and Netezza Performance Server, MongoDB. You can use them to setup connection and add as datasource in following steps.

Get the connection information

If you are using any database such as Db2 warehouse on IBM Cloud, Netezza Performance Server, ensure that you have obtained the JDBC connection details and SSL certificate using the instructions provided as part of Step 4 above.

If you are using the local Db2 warehouse on IBM Cloud Pak for Data, ensure that you have obtained the JDBC connection details using the instructions provided as part of Step 5 above.

Add the new data source

To add a new data source, go to the hamburger (☰) menu and click on the Connections option.

Collections

At the overview, click New connection +.

Overview page

Start by giving your new connection a name, and select the connection type. select Db2 Warehouse on Cloud as your connection type for IBM Db2 database or PureData System For Analytics for Netezza Performcance server.

More fields should appear. Fill in the new fields with the credentials for your Db2 Warehouse connection (local or cloud). Click the checkbox for Use SSL.

NOTE: If you are using Db2 warehouse on IBM Cloud, click Select file and navigate to where you converted the SSL certificate for Db2 Warehouse from a .crt file to a .pem file (probably called DigiCertGlobalRootCA.pem). Click Test Connection, and after that succeeds, click Create.

Add a Db2 Warehouse on Cloud connection

The new connection will be listed in the overview.

Connection added

Step 7. Virtualize data with data virtualization

NOTE: This section requires Admin user access to the IBM Cloud Pak for Data cluster.

For this section, we’ll use the data virtualization tool to import the data from any database like IBM Db2 Warehouse, Netezza Performance Server, MongoDB, which is exposed as a connection in IBM Cloud Pak for Data.

  1. To launch the data virtualization tool, go the hamburger (☰) menu and click Collect and then Data Virtualization. Collect Data Virtualization

  2. At the empty overview, click the drop-down next to Add new data source and select From existing connections. No data sources

  3. Select the data source we made in the previous step and click Next. The data source could be one of IBM Db2 or Netezza (Pure data systems) or MongoDB or any from the list. Add the Db2 Warehouse connection

  4. The new connection will be listed as a data source for data virtualization. Db2 Warehouse connection is now associated with data virtualization

Start virtualizing data

Because we now have access to the Database data such as IBM Db2 Warehouse or Netezza performanc Server , we can virtualize the data to our IBM Cloud Pak for Data project. Click on the Data Sources drop-down and choose Virtualize.

Virtualize

Several tables will appear (many are created as sample data when a Db2 Warehouse instance is provisioned) in the table. Find the tables you created earlier, the previous instructions suggested naming them CUSTOMERS, PRODUCTS, and BILLING. Once selected, click Add to cart and then View Cart.

Choose the tables to virtualize

The next panel prompts the user to choose which project to assign the data to. Choose My virtualized data and uncheck the box that says Submit to catalog. Click Virtualize to start the process.

Add virtualized data to your project

You’ll be notified that the virtual tables have been created. Let’s see the new virtualized data from the data virtualization tool by clicking View my virtualized data.

We've got virtualized data

Join the virtualized data

Now we are going to join the tables we created so we have a merged set of data. It will be easier to do it here rather than in a notebook where we would have to write code to handle three different datasets. Click on any two tables (PRODUCTS and BILLING, for instance), then click the Join button.

Join two tables

To join the tables, we need to pick a key that is common to both datasets. Here we choose to map customerID from the first table to customerID on the second table. Do this by clicking one and dragging it to the other. When the line is drawn, click Next.

Map the two customer ID keys

Next, you have a chance to edit column names, but we will keep them as-is. Click Next.

Edit column names in joined table

In the next panel, we’ll give our joined data a unique name such as BILLINGPRODUCTS (to be consistent with SQL standards, pick an uppercase name). Under Assign to, choose My virtualized data and uncheck the box that says Submit to catalog. Click Create view to start the process.

Add joined data tables to your project

You’ll be notified that the join has succeeded. Click on View my virtualized data to go back and see all your virtualized data.

The data join succeeded

IMPORTANT Now join the new joined view (BILLINGPRODUCTS) and the last virtualized table (CUSTOMERS) to create a new joined view that has all three tables; let’s call it BILLINGPRODUCTSCUSTOMERS. Switching back to the My virtualized data screen should show all three virtualized tables and two joined tables. Do not go to the next section until this step is performed.

Our datasets at the end of this section

Grant access to the virtualized data

For other users to have access to the data you just virtualized, you need to grant it. Follow these steps to make your virtualized data visible to them:

  1. Go to Data Virtualization from the hamburger (☰) menu. Click on Menu > My virtualized data.

  2. Click on the virtualized data you’ve created, then click the three vertical dots to the right, and choose Manage access. Manage access to virtualized data

  3. Click the Specific users radio button, then Add user +. Grant Access to specific users

  4. Select the users you wish to grant access to and click Add users. Select users to grant access to

Repeat the above steps for the remaining tables and views.

Assign the Engineer role to the users

IBM Cloud Pak for Data users that need to use data virtualization functions must be assigned specific roles based on their job descriptions. These roles are Admin, Engineer, User, and Steward. You can learn more about these roles on the IBM Cloud Pak for Data product hub.

Let’s assign the Engineer role to some users:

  1. From the hamburger (☰) menu, choose the Data Virtualization option, then click My virtualized data > User management. Manage users in data virtualization

  2. Click on Add users + and update the role of your users to Engineer. Update roles in data virtualization

Step 7. Users assign virtualized data

Now let’s look at how a user who has access to virtualized data can assign the data to their project — how to add the virtualized data as asset to a project.

Assign the data to your project

  1. From the hamburger (☰) menu, click on Collect > Data Virtualization. You will be brought to the My virtualized data section. Here you should see the data you can access (or that the administrator has assigned to you). Select the checkbox next to our original tables (BILLING, PRODUCTS, CUSTOMERS) and the joined tables (BILLINGPRODUCTS, BILLINGPRODUCTSCUSTOMERS), and click the Assign button to import them into your project. Select data to import

  2. On the Assign virtual objects screen, choose the project to assign the data. If there is a Submit to catalog checkbox on the top right, uncheck it and click the Assign button to add the data to your project. Assign the data to a project

  3. In the pop-up panel, you will receive a confirmation that the objects have been assigned to your project. Click the Go to project button. Objects assigned to project

  4. Click on Go to project. Alternatively, close the model and go to your projects by clicking on the hamburger (☰) menu, then choosing Projects.

On the project page, clicking on the Assets tab will show the virtualized tables and joined tables that are now in your project.

Our datasets at the end of this section

Summary

This tutorial explained how to virtualize Db2 Warehouse data with data virtualization on IBM Cloud Pak for Data to make queries across multiple data sources. This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path. To continue the series and learn more about IBM Cloud Pak for Data, take a look at the next tutorial, Data visualization with Data Refinery.