Virtualizing Db2 Warehouse data with data virtualization
Use data virtualization on IBM Cloud Pak for Data to make queries across multiple data sources
This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path.
|100||Introduction to IBM Cloud Pak for Data||Article|
|101||Virtualizing Db2 Warehouse data with data virtualization||Tutorial|
|201||Data visualization with data refinery||Tutorial|
|301||Data analysis, model building, and deploying with Watson Machine Learning with notebook||Pattern|
|401||Monitoring the model with Watson OpenScale||Pattern|
For decades, companies have tried to break down silos by copying data from different operational systems into central data stores for analysis, such as data marts, data warehouses and data lakes. This is costly and prone to error. Most struggle to manage an average of 33 unique data sources, which are diverse in structure and type, and are often trapped in data silos that are hard to find and access.
With data virtualization, you can query data across many systems without having to copy and replicate data, which reduces costs. It also can simplify your analytics and make them more up to date and accurate because you’re querying the latest data at its source.
In this tutorial, we’re going to learn how to virtualize Db2 Warehouse data with Data Virtualization on IBM Cloud Pak for Data to make queries across multiple data sources.
In this tutorial you will learn how to:
- Add data sets to IBM Cloud Pak for Data.
- Add a data source to Data Virtualization.
- Virtualize the data and create a joined view.
- Assign virtualized data to a project.
- Add roles to users and perform admin tasks.
This tutorial will take approximately 30-45 minutes to complete.
Step 1: Get the data
Download the 3 data files:
Step 2: About the data set
The data set used for this tutorial is originally from Watson Analytics and was used on a Kaggle project, it contains information about customer churn for a Telecommunications company. The data is split into three CSV files.
This file has the following attributes:
- Customer ID
- Contract (Month-to-month, one year, two year)
- Paperless Billing (Yes, No)
- Payment Method (Bank transfer, Credit card, Electronic check, Mailed check)
- Monthly Charges ($)
- Total Charges ($)
- Churn (Yes, No)
- Customer ID
- Gender (Male, Female)
- Senior Citizen (1, 0)
- Partner (Yes, No)
- Dependents (Yes, No)
- Tenure (1-100)
- Customer ID
- Phone Service (Yes, No)
- Multiple Lines (Yes, No, No phone service)
- Internet Service (DSL, Fiber optic, No)
- Online Security (Yes, No, No internet service)
- Online Backup (Yes, No, No internet service)
- Device Protection (Yes, No, No internet service)
- Tech Support (Yes, No, No internet service)
- Streaming TV (Yes, No, No internet service)
- Streaming Movies (Yes, No, No internet service)
Step 3: Seeding our Db2 Warehouse Database
We’ll need a place to store our data. For this workshop, we’ve opted to use Db2 Warehouse on our local Cloud Pak for Data cluster. Note that Cloud Pak for Data can work with any database with a JDBC connector, so this is only one of many choices.
Load data into local DB2 Warehouse
These instructions are for loading the data into the local Cloud Pak for Data version of DB2 Warehouse. They will be similar for the IBM Cloud version.
Click the (☰) hamburger menu in the upper left corner, and choose
Go to the Databases tab, click on the 3 vertial lines on the DB2 Warehouse tile, and click
Navigate to where you cloned this repository, then to
data/split/ and choose
billing.csv, then click
NULLIDRA and click
+ New table. Under “New Table Name” type “BILLING” and click
Accept the defaults and click
Repeat for the
products.csv file, naming the table
PRODUCTS and the
customer-service.csv file, naming the table
Step 4: Creating a new IBM Cloud Pak for Data project
Log in to IBM Cloud Pak for Data
Launch a browser and navigate to your IBM Cloud Pak for Data deployment
Create a new project
Go the (☰) menu and click Projects
Create a new project, choose
Analytics project, give it a unique name, and click *OK. Click
Create on the next screen.
Step 5: Add a new data source connection
For IBM Cloud Pak for Data to read our Db2 Warehouse data we need to add a new Data Source to IBM Cloud Pak for Data. This requires inputting the usual JDBC details.
To get the connection info for you local DB2 Warehouse, go to the (☰) menu, and click on the My Instances option.
In My instances, go to the Provisioned instances tab. Highlight your local DB2 Warehouse, click the 3 vertical dots on the far right, and then click
Either keep this window open in a separate tab, or copy the required Connection information: Host, Port, Database name, Username, and Password. You can get the port from the JDBC Connection URL, for example, for the URL
jdbc:db2://os-workshop-nov22worker-05.vz-cpd-nov22.com:30290/BLUDB the port is the number after the colin in the URL
Add the new data source
To add a new data source, go to the (☰) menu and click on the Connections option.
Start by giving your new Connection a name, and select Db2 Warehouse on Cloud as your connection type. More fields should appear. Fill in the new fields with the same credentials for your local Db2 Warehouse connection from the previous section. Click
Test Connection. After that succeeds with
Success The test connection was successful., click
Add to save the connection information.
The new connection will be listed in the overview.
Virtualize Db2 data with Data Virtualization
NOTE: This section requires
Adminuser access to the IBM Cloud Pak for Data cluster.
For this section we’ll now use the Data Virtualization tool to import the data from Db2 Warehouse, which is now exposed as an Connection in IBM Cloud Pak for Data.
Add a data source to Data Virtualization
To launch the data virtualization tool, go the (☰) menu and click
Collect and then
Select the data source we made in the previous step, and click Next.
The new connection will be listed as a data source for data virtualization.
Start virtualizing data
In this section, because we now have access to the Db2 Warehouse data, we can virtualize the data to our IBM Cloud Pak for Data project. Click on the Menu button and choose Virtualize.
BILLING. Once selected click on Add to cart and then on View Cart.
The next panel prompts the user to choose which project to assign the data to, choose the project you created in the previous exercise. Click Virtualize to start the process.
You’ll be notified that the virtual tables have been created! Let’s see the new virtualized data from the Data Virtualization tool by clicking View my data.
Join the virtualized data
Now we’re going to join the tables we created so we have a merged set of data. It will be easier to do it here rather than in a notebook where we’d have to write code to handle three different data sets. Click on any two tables (
BILLING for instance) and click the Join view button.
To join the tables we need to pick a key that is common to both data sets. Here we choose to map
customerID from the first table to
customerID on the second table. Do this by clicking on one and dragging it to another. When the line is drawn click on Join.
In the next panel, we’ll give our joined data a unique name (to be consistent with SQL standards, pick an all uppercase name). I chose
XXX is my all uppercase user ID). Then, review the joined table to ensure that all columns are present and only one
customerID column exists. Click Next to continue.
Next we choose which project to assign the joined view to, choose the project you created in the previous exercise. Click Create view to start the process.
You’ll be notified that the join has succeeded! Click on View my virtualized data. to repeat this again so we have all three tables.
IMPORTANT Repeat the same steps as above, but this time choose to join the new joined view (
XXXBILLINGPRODUCTS) and the last virtualized table (
CUSTOMERS), to create a new joined view that has all three tables, let’s call it
XXXBILLINGPRODUCTSCUSTOMERS. Switching to our project should show all three virtualized tables, and two joined tables. Do not go to the next section until this step is performed.
Grant access to the virtualized data
For other users to have access to the data that you just virtualized, you need to grant them access. Follow these steps to make your virtualized data visible to them.
Go to Data Virtualization from the (☰) menu. Click on
My virtualized data.
Click on the virtualized data you’ve created, then click the 3 horizontal dots
... to the right of one, and choose
Specific users button and click
+ grant access:
Select the users you wish to grant access to and click
Assign the “Steward” role to the users.
Go to Data Virtualization option from the menu. Click on Manage users
Click on Add user and ensure all users have the Steward role.
Step 6: Users Assign virtualized data
Assign the data to your project
From the menu, click on Collections -> Data Virtualization. You’ll be brought to the My data section. Here you should see the data that the administrator has assigned to you. Choose the data sets available, and click Assign to start importing it to your project.
From here, choose the project you previously created.
Switching to our project should show the virtualized tables and the joined tables. Do not go to the next section until this step is performed.
This tutorial explained how to virtualize Db2 Warehouse data with data virtualization on IBM Cloud Pak for Data to make queries across multiple data sources. This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path. To continue the series and learn more about IBM Cloud Pak for Data, take a look at the next tutorial, Data visualization with data refinery.