Kubernetes with OpenShift World Tour: Get hands-on experience and build applications fast! Find a workshop!

Query across distributed data sources as one: Data virtualization for data analytics

Introduction

This tutorial is part of a series that explores telecom call-drop predictions using IBM Cloud Pak for Data, Data Virtualization, Watson OpenScale, and Cognos Analyics.

There’s a growing number of large-scale enterprises that need to perform targeted data analytics on multiple data sources. The integration of data from multiple sources is a challenging task. Data virtualization presents a modern approach to data integration. Unlike ETL solutions that replicate data, data virtualization leaves the data in source systems, simply exposing an integrated view of all the data to data consumers. As business users drill down into reports, data virtualization fetches the data in real time from the underlying source systems. Data virtualization proves that connecting to data is far superior to collecting it.

Data can reside in multiple data marts. And data virtualization, within IBM Cloud Pak for Data, can perform data integration seamlessly. Data virtualization can connect to data wherever it resides and provides the ability to view, access, manipulate and analyze data without the need to know or understand its physical format and location. Data virtualization creates virtual tables to join data from different data sources, then allows running queries against the resulting virtual table.

In this Tutorial, we’ll demonstrate how data virtualization can be achieved on various data sources with an Oracle Database hosted on Amazon Web Services with IBM Cloud Db2 Warehouse and IBM Db2 on Cloud.

Learning objectives

After completing this tutorial, you’ll learn:

  1. The concept of data virtualization.
  2. How to create connections from databases hosted on multiple environments, including Amazon Web Services, Azure, IBM Cloud, or on-premise.
  3. How to create views from Joins and publish data to your current project.

Prerequisites

Estimated time

Completing this tutorial should take about 30 minutes.

Steps

Step 1: Add data connection to your IBM Cloud Pak for Data instance

  • From your IBM Cloud Pak for Data instance, go to the Virtualize Data tab. Click on the top left navigation menu > Collect Data > Virtualize Data
  • Click on Add and Add a new db Connection to your IBM Cloud Pak for Data instance.

Add data connection image

Step 2: Add the database connection as a data source in IBM Cloud Pak for Data

  • Enter the desired connection name, and select the Connection Type. In this Tutorial, we demonstrate using an Oracle database on AWS and Db2 on IBM Cloud.

Add Db connection image

  • Once you have selected the database, enter the necessary connection information for the db- Host, Port, Username, Password.

Enter db details image

  • Enter the db credentials

Enter db credentials image

  • Click on Test Connection. After it is successful, Save the connection.

  • Repeat the same steps for another database. In this case, Db2 on Cloud.

Step 3: Virtualize your data

  • Select the tables which you wish to Virtualize. The tables will be filtered based on their source. For example, Oracle db or IBM Db. Check the box against the table name and click on Add to Cart.

Select table image

  • Once you have completed adding the tables to the cart, Click on View Cart. Assign the tables to your current working project and click on Virtualize.

Assign tables

Step 4: Merge the virtualized data

  • Once you have virtualized data, click on View My Data.
  • Your data will be added to the My Data section, where all your local/virtualized data will reside.
  • Select the desired tables from the virtualized list, by checking the box against the names of the tables. Then, click on Join View.

Merge data image

  • Next, you will be a given a view of the schema of 2 tables. You can drag and drop to represent the Primary Key and Foreign Key relationship. Then Click on Preview to see how it has joined.

Key relationship image

Note:

  1. Make sure you drag Primary Key to Foreign Key (only this direction), in order to make the join work.
  2. If you wish to combine more than 2 tables or do a different kind of join. Click on the SQL Editor and write your SQL Query. The only requirement is the data being referenced in the query must be virtualized.

Step 5: Assign the view to the internal database and a project

  • After viewing the Preview, you can save the view created by clicking on Join.
  • Give a name to the view created and click on next and select the internal database schema you want the view to be published in.
  • Assign the created view to your project and click on Create View.

Assign project image

  • To access the Created Virtualized Table, go to Menu > Service Details. You can use the credentials in this page to access all the merged and virtualized views in your IBM Cloud Pak for Data instance.

Service detials image

Note: Before assigning to project, make sure you have a new project created.

Troubleshooting

  • If the user login is unable to access data virtualization: Go to Menu > Manage Users.

Manage users image

  • Click on Add User and select your current user login details.

Add users image

  • If you are unable to preview your join or getting zero rows of data
  • Go to Menu > Service Details
  • Under the Virtual object access control section, click on Disable.

Disable button image

Summary

The typical flow of a data science project begins with aggregating and performing mining on the data from various sources. This tutorial allows you to seamlessly virtualize data for your data science project and continue the next steps. The main advantage of the data virtualization capabilities of IBM Cloud Pak for Data is the ability to seamlessly join databases from any data source, providing an easy integration platform.

Smruthi Raj Mohan
Srikanth Manne