This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path.
Level | Topic | Type |
---|---|---|
100 | Introduction to IBM Cloud Pak for Data | Article |
101 | Data Virtualization on IBM Cloud Pak for Data | Tutorial |
201 | Data visualization with Data Refinery | Tutorial |
202 | Find, prepare, and understand data with Watson Knowledge Catalog | Tutorial |
301A | Data analysis, model building, and deploying with Watson Machine Learning with notebook | Pattern |
301B | Automate model building with AutoAI | Tutorial |
301C | Build a predictive machine learning model quickly and easily with IBM SPSS Modeler | Tutorial |
401 | Monitor the model with Watson OpenScale | Pattern |
For decades, companies have tried to break down silos by copying data from different operational systems into central data stores for analysis, such as data marts, data warehouses, and data lakes. This is often costly and prone to error. Most struggle to manage an average of 33 unique data sources, which are diverse in structure and type, and are often trapped in data silos that are hard to find and access.
With data virtualization, you can query data across many systems without having to copy and replicate data, which helps reduce costs. It also can simplify your analytics and make them more up to date and accurate because you’re querying the latest data at its source.
In this tutorial, we’re going to learn how to use data virtualization on IBM Cloud Pak® for Data to make queries across multiple data sources like Netezza® Performance Server and Db2® Warehouse.
Learning objectives
In this tutorial, you will learn how to:
- Add datasets to IBM Cloud Pak for Data.
- Add a data source for data virtualization.
- Virtualize the data and create a joined view.
- Assign virtualized data to a project.
- Add roles to users and perform admin tasks.
Prerequisites
Estimated time
This tutorial will take approximately 30-45 minutes to complete.
Step 1. Get the data
Download the three data files:
Step 2. About the dataset
The dataset used for this tutorial is originally from Watson Analytics®. It was used on a Kaggle project, and it contains information about customer churn for a telecommunications company. The data is split into three CSV files.
billing.csv
This file has the following attributes:
- Customer ID
- Contract (Month-to-month, one year, two year)
- Paperless Billing (Yes, No)
- Payment Method (Bank transfer, Credit card, Electronic check, Mailed check)
- Monthly Charges ($)
- Total Charges ($)
- Churn (Yes, No)
customer-service.csv
- Customer ID
- Gender (Male, Female)
- Senior Citizen (1, 0)
- Partner (Yes, No)
- Dependents (Yes, No)
- Tenure (1-100)
products.csv
- Customer ID
- Phone Service (Yes, No)
- Multiple Lines (Yes, No, No phone service)
- Internet Service (DSL, Fiber optic, No)
- Online Security (Yes, No, No internet service)
- Online Backup (Yes, No, No internet service)
- Device Protection (Yes, No, No internet service)
- Tech Support (Yes, No, No internet service)
- Streaming TV (Yes, No, No internet service)
- Streaming Movies (Yes, No, No internet service)
Step 3. Set up the project and provision data virtualization on IBM Cloud Pak for Data
Log in to IBM Cloud Pak for Data
Launch a browser and navigate to your IBM Cloud Pak for Data deployment.
Create a new IBM Cloud Pak for Data project
Go the hamburger (☰) menu and click Projects.
Click on New project.
Select Create an empty project.
Provide a name and optional description for the project and click Create.
Provision data virtualization on IBM Cloud Pak for Data
Go to the Services tab. Under Data sources, choose the Data Virtualization tile. Click the three vertical dots and choose Deploy.
Follow the instructions to deploy data virtualization.
For deployment using Managed OpenShift®, you must do the following:
- IMPORTANT: Decide whether to check the Updated the kernel semaphore parameter checkbox.
- IMPORTANT: Do NOT choose the defaults for storage. You must choose ibmc-file-gold-gid as the storage class.
Step 4. Set up the database
IBM Cloud Pak for Data can work with any database with a JDBC connector. For this tutorial we will be using IBM Db2 Warehouse on cloud, IBM Db2 local and Netezza® Performance Server.
Set up Netezza Performance Server
Before you create connection to IBM Netezza Performance Server (NPS), you should create required tables and load the csv data
into IBM NPS server using nzload
cli. To install nzload
cli, follow the instructions.
Login to your IBM NPS console and create 3 tables for billing, customer and product
. Note that the tables should exist before you load the data using nzload
. Then you can use you nzload
cli command to load the csv data to your NPS database. Please download the data from above for billing, customer and product
:
bash
nzload -u <user> -pw <password> -host <host> -db <database> -t <table name> -delim ',' -df <csv file name>
If the nzload
cli is not supported for example in Mac OSX, you will have to create insert statments for the csv data provided and run it from the Netezza console. This might take little longer than nzload command.
Set up the Db2 Warehouse on IBM Cloud
We’ll need a place to store our data. It is suggested to use Db2 Warehouse on IBM Cloud to best conserve resources on the cluster. If you wish to use the local Db2 on the cluster, skip this section and set up the local Db2 warehouse on IBM Cloud Pak for Data instead.
NOTE: Ensure that you have provisioned Db2 warehouse on IBM Cloud before proceeding.
Get connection details for Db2 Warehouse on IBM Cloud.
Go to Service Credentials and click New credential +. Click the Copy to clipboard icon and save the credentials for later.
Now go to Manage and click Open Console.
Get SSL certificate for Db2 Warehouse on IBM Cloud
You will need an SSL cert for IBM Cloud Pak for Data to use the IBM Cloud Db2 Warehouse instance.
In the Db2 Warehouse console, from the upper-left hamburger (☰) menu, click CONNECTION INFO > Connection Information, then Download SSL Certificate.
You will need to convert the SSL certificate from .crt to a .pem file using OpenSSL. Run the following command:
bash
openssl x509 -in DigiCertGlobalRootCA.crt -out DigiCertGlobalRootCA.pem -outform PEM -inform DER
Seed the Db2 Warehouse on IBM Cloud
From the upper-left hamburger (☰) menu, click LOAD > Load data.
Click browse files and select the billing.csv file that was downloaded earlier, then click Next.
Choose Schema NULLIDRA and click + New table. Under Create a new Table, provide
BILLING
as the name of the table and click Create, then Next.Accept the defaults and click Next. On the next screen, click Begin Load.
Repeat for the products.csv file, naming the table PRODUCTS
and the customer-service.csv table CUSTOMERS
.
Now that the Db2 warehouse has been set up on IBM Cloud, you may now skip to add a new data source connection.
Set up the local Db2 Warehouse on IBM Cloud Pak for Data
These instructions are for loading the data into the local CP4D version of Db2 Warehouse. If you’ve used the IBM Cloud instance of Db2 Warehouse, you can skip to the next section. It is suggested to use Db2 Warehouse on IBM Cloud to conserve resources on the CPD cluster as described in set up the Db2 warehouse on IBM Cloud. If you wish to use the local Db2 on the cluster, continue with this section.
Note that IBM Cloud Pak for Data can work with any database with a JDBC connector, so Db2 warehouse is only one of many choices.
NOTE: Ensure that you have provisioned the local Db2 warehouse on IBM Cloud Pak for Data before proceeding.
IMPORTANT: At least one OpenShift node should have the tag for db2wh (such as icp4data=database-db2wh
); otherwise, you will not be able to provision the local Db2 warehouse. The tag can be set using the following command:
oc label node <NODE> icp4data=database-db2wh
Get connection details for local Db2 warehouse
To get the connection info for you local Db2 Warehouse, go to the hamburger (☰) menu, and click on the My Instances option.
Now go to the Provisioned instances tab and click on the row for your local Db2 Warehouse. This will open the details of the Db2 warehouse instance.
Either keep this window open in a separate tab, or copy the required connection information: Host, Port, Database name, Username, and Password. You can get the port from the JDBC Connection URL; for example, for the URL jdbc:db2://10.221.243.172:31956/BLUDB, the port is the number after the colon in the URL: 31956.
Seed the local Db2 warehouse on IBM Cloud Pak for Data
Click the hamburger (☰) menu in the upper-left corner and choose Collect > My data.
Go to the Databases tab, click on the three vertical lines on the Db2 Warehouse tile, and click Open database.
Under Summary, choose Load and Load Data.
Choose Browse files and select the billing.csv file downloaded earlier, then click Next.
Choose Schema NULLIDRA and click + New table. Under New Table Name, type
BILLING
, click Create, then Next.Accept the defaults and click Next. On the next screen, click Begin Load.
Repeat for the products.csv file, naming the table PRODUCTS
and the customer-service.csv table CUSTOMERS
.
Step 5. Set up the local Db2 Warehouse on IBM Cloud Pak for Data
These instructions are for loading the data into the local IBM Cloud Pak for Data version of Db2 Warehouse. If you’ve used the IBM Cloud instance of Db2 Warehouse, you can skip to the next section.
It is suggested to use Db2 Warehouse on IBM Cloud to conserve resources on the CPD cluster as described in set up the Db2 warehouse on IBM Cloud. If you wish to use the local Db2 on the cluster, continue with this section.
Note that IBM Cloud Pak for Data can work with any database with a JDBC connector, so Db2 warehouse is only one of many choices.
NOTE: Ensure that you have provisioned the local Db2 warehouse on IBM Cloud Pak for Data before proceeding.
IMPORTANT: At least one OpenShift node should have the tag for db2wh (such as icp4data=database-db2wh
); otherwise, you will not be able to provision the local Db2 warehouse. The tag can be set using the following command:
oc label node <NODE> icp4data=database-db2wh
Get connection details for local Db2 warehouse
To get the connection info for your local Db2 Warehouse, go to the hamburger (☰) menu and click on the My Instances option.
Now go to the Provisioned instances tab and click on the row for your local Db2 Warehouse. This will open the details of the Db2 warehouse instance.
Either keep this window open in a separate tab or copy the required connection information: Host, Port, Database name, Username, and Password. You can get the port from the JDBC Connection URL; for example, for the URL jdbc:db2://10.221.243.172:31956/BLUDB, the port is the number after the colon in the URL: 31956.
Seed the local Db2 warehouse on IBM Cloud Pak for Data
Click the hamburger (☰) menu in the upper-left corner and choose Collect > My data.
Go to the Databases tab, click on the three vertical lines on the Db2 Warehouse tile and click Open database.
Under Summary, choose Load > Load Data.
Choose Browse files and select the billing.csv file downloaded earlier, then click Next.
Choose Schema NULLIDRA and click + New table. Under New Table Name, type
BILLING
, then click Create > Next.Accept the defaults and click Next. On the next screen, click Begin Load.
Repeat for the products.csv file, naming the table PRODUCTS
and the customer-service.csv table CUSTOMERS
.
Step 6. Add a new data source connection
IBM Cloud Pak for Data can work with any database such as IBM Db2 Warehouse on cloud, IBM Db2 local and Netezza Performance Server, MongoDB. You can use them to setup connection and add as datasource in following steps.
Get the connection information
If you are using any database such as Db2 warehouse on IBM Cloud, Netezza Performance Server, ensure that you have obtained the JDBC connection details and SSL certificate using the instructions provided as part of Step 4 above.
If you are using the local Db2 warehouse on IBM Cloud Pak for Data, ensure that you have obtained the JDBC connection details using the instructions provided as part of Step 5 above.
Add the new data source
To add a new data source, go to the hamburger (☰) menu and click on the Connections option.
At the overview, click New connection +.
Start by giving your new connection a name, and select the connection type. select Db2 Warehouse on Cloud as your connection type for IBM Db2 database or PureData System For Analytics for Netezza Performcance server.
More fields should appear. Fill in the new fields with the credentials for your Db2 Warehouse connection (local or cloud). Click the checkbox for Use SSL.
NOTE: If you are using Db2 warehouse on IBM Cloud, click Select file and navigate to where you converted the SSL certificate for Db2 Warehouse from a .crt file to a .pem file (probably called DigiCertGlobalRootCA.pem). Click Test Connection, and after that succeeds, click Create.
The new connection will be listed in the overview.
Step 7. Virtualize data with data virtualization
NOTE: This section requires Admin user access to the IBM Cloud Pak for Data cluster.
For this section, we’ll use the data virtualization tool to import the data from any database like IBM Db2 Warehouse, Netezza Performance Server, MongoDB, which is exposed as a connection in IBM Cloud Pak for Data.
To launch the data virtualization tool, go the hamburger (☰) menu and click Collect and then Data Virtualization.
At the empty overview, click the drop-down next to Add new data source and select From existing connections.
Select the data source we made in the previous step and click Next. The data source could be one of IBM Db2 or Netezza (Pure data systems) or MongoDB or any from the list.
The new connection will be listed as a data source for data virtualization.
Start virtualizing data
Because we now have access to the Database data such as IBM Db2 Warehouse or Netezza performanc Server , we can virtualize the data to our IBM Cloud Pak for Data project. Click on the Data Sources drop-down and choose Virtualize.
Several tables will appear (many are created as sample data when a Db2 Warehouse instance is provisioned) in the table. Find the tables you created earlier, the previous instructions suggested naming them CUSTOMERS, PRODUCTS, and BILLING. Once selected, click Add to cart and then View Cart.
The next panel prompts the user to choose which project to assign the data to. Choose My virtualized data and uncheck the box that says Submit to catalog. Click Virtualize to start the process.
You’ll be notified that the virtual tables have been created. Let’s see the new virtualized data from the data virtualization tool by clicking View my virtualized data.
Join the virtualized data
Now we are going to join the tables we created so we have a merged set of data. It will be easier to do it here rather than in a notebook where we would have to write code to handle three different datasets. Click on any two tables (PRODUCTS and BILLING, for instance), then click the Join button.
To join the tables, we need to pick a key that is common to both datasets. Here we choose to map customerID from the first table to customerID on the second table. Do this by clicking one and dragging it to the other. When the line is drawn, click Next.
Next, you have a chance to edit column names, but we will keep them as-is. Click Next.
In the next panel, we’ll give our joined data a unique name such as BILLINGPRODUCTS
(to be consistent with SQL standards, pick an uppercase name). Under Assign to, choose My virtualized data and uncheck the box that says Submit to catalog. Click Create view to start the process.
You’ll be notified that the join has succeeded. Click on View my virtualized data to go back and see all your virtualized data.
IMPORTANT Now join the new joined view (BILLINGPRODUCTS) and the last virtualized table (CUSTOMERS) to create a new joined view that has all three tables; let’s call it BILLINGPRODUCTSCUSTOMERS
. Switching back to the My virtualized data screen should show all three virtualized tables and two joined tables. Do not go to the next section until this step is performed.
Grant access to the virtualized data
For other users to have access to the data you just virtualized, you need to grant it. Follow these steps to make your virtualized data visible to them:
Go to Data Virtualization from the hamburger (☰) menu. Click on Menu > My virtualized data.
Click on the virtualized data you’ve created, then click the three vertical dots to the right, and choose Manage access.
Click the Specific users radio button, then Add user +.
Select the users you wish to grant access to and click Add users.
Repeat the above steps for the remaining tables and views.
Assign the Engineer role to the users
IBM Cloud Pak for Data users that need to use data virtualization functions must be assigned specific roles based on their job descriptions. These roles are Admin, Engineer, User, and Steward. You can learn more about these roles on the IBM Cloud Pak for Data product hub.
Let’s assign the Engineer role to some users:
From the hamburger (☰) menu, choose the Data Virtualization option, then click My virtualized data > User management.
Click on Add users + and update the role of your users to Engineer.
Step 7. Users assign virtualized data
Now let’s look at how a user who has access to virtualized data can assign the data to their project — how to add the virtualized data as asset to a project.
Assign the data to your project
From the hamburger (☰) menu, click on Collect > Data Virtualization. You will be brought to the My virtualized data section. Here you should see the data you can access (or that the administrator has assigned to you). Select the checkbox next to our original tables (BILLING, PRODUCTS, CUSTOMERS) and the joined tables (BILLINGPRODUCTS, BILLINGPRODUCTSCUSTOMERS), and click the Assign button to import them into your project.
On the Assign virtual objects screen, choose the project to assign the data. If there is a Submit to catalog checkbox on the top right, uncheck it and click the Assign button to add the data to your project.
In the pop-up panel, you will receive a confirmation that the objects have been assigned to your project. Click the Go to project button.
Click on Go to project. Alternatively, close the model and go to your projects by clicking on the hamburger (☰) menu, then choosing Projects.
On the project page, clicking on the Assets tab will show the virtualized tables and joined tables that are now in your project.
Summary
This tutorial explained how to virtualize Db2 Warehouse data with data virtualization on IBM Cloud Pak for Data to make queries across multiple data sources. This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path. To continue the series and learn more about IBM Cloud Pak for Data, take a look at the next tutorial, Data visualization with Data Refinery.