Query across distributed data sources as one: Data virtualization for data analytics

Introduction

This tutorial is part of a series that explores telecom call-drop predictions using IBM Cloud Pak® for Data, data virtualization, IBM Watson® OpenScale™, and Cognos® Analyics.

Level Topic Type
201 Query across distributed data sources as one: Data virtualization for data analytics Tutorial
201 Monitor your machine learning models using Watson OpenScale in IBM Cloud Pak for Data Pattern
301 Build dashboards in Cognos Analytics on IBM Cloud Pak for Data Tutorial
301 Predict, manage, and monitor the call drops of cell towers using IBM Cloud Pak for Data Pattern

There’s a growing number of large-scale enterprises that need to perform targeted data analytics on multiple data sources. The integration of data from multiple sources is a challenging task. Data virtualization presents a modern approach to data integration. Unlike ETL solutions that replicate data, data virtualization leaves the data in source systems, simply exposing an integrated view of all the data to data consumers. As business users drill down into reports, data virtualization fetches the data in real time from the underlying source systems. Data virtualization proves that connecting to data is far superior to collecting it.

Data can reside in multiple data marts. And data virtualization, within IBM Cloud Pak for Data, can perform data integration seamlessly. Data virtualization can connect to data wherever it resides and provides the ability to view, access, manipulate and analyze data without the need to know or understand its physical format and location. Data virtualization creates virtual tables to join data from different data sources, then allows running queries against the resulting virtual table.

In this tutorial, we’ll demonstrate how data virtualization can be achieved on various data sources with an Oracle database hosted on Amazon Web Services with IBM Cloud Db2® Warehouse and IBM Db2 on Cloud. You can also connect to Netezza Performance Server by selecting from the menu.

Learning objectives

In this tutorial, you’ll learn:

  • The concept of data virtualization
  • How to create connections from databases hosted on multiple environments, including Amazon Web Services, Azure, IBM Cloud, Netezza Performance Server, and more
  • How to create views from Joins and publish data to your current project

Prerequisites

Estimated time

Completing this tutorial should take about 30 minutes.

Steps

Step 1. Add data connection to your IBM Cloud Pak for Data instance

  1. From your IBM Cloud Pak for Data instance, go to the Virtualize Data tab. From the top-left navigation menu, click Collect Data > Virtualize Data.

  2. Click Add to add a database connection to your IBM Cloud Pak for Data instance. Add data connection

Step 2. Add the database connection as a data source in IBM Cloud Pak for Data

  1. Enter the desired connection name and select Connection Type. In this tutorial, we demonstrate using an Oracle database on AWS and Db2 on IBM Cloud. Add connection

  2. Once you’ve selected the database, enter the necessary connection information for the database: host, port, username, password. Enter details

  3. Enter the database credentials. Enter credentials

  4. Click Test Connection, then Save.

  5. Repeat the same steps for another database — in this case, Db2 on Cloud.

Step 3. Virtualize your data

  1. Select the tables you wish to virtualize. The tables will be filtered based on their source — Oracle or IBM database, for example. Check the box against the table name and click Add to Cart. Select table

  2. Once you have completed adding the tables to the cart, click View Cart. Assign the tables to your current working project and click Virtualize. Assign tables

Step 4. Merge the virtualized data

  1. Once you have virtualized data, click View My Data.

  2. Your data will be added to the My Data section, where all your local/virtualized data will reside.

  3. Select the desired tables from the virtualized list by checking the box against the names of the tables, then click Join View. Merge data

  4. Next, you will be a given a view of the schema of two tables. You can drag and drop to represent the Primary Key and Foreign Key relationship. Click Preview to see how it has joined. Key relationship

Note:

  1. Make sure you drag Primary Key to Foreign Key (only this direction), in order to make the join work.
  2. If you wish to combine more than two tables or do a different kind of join, click SQL Editor and write your SQL query. The only requirement is the data being referenced in the query must be virtualized.

Step 5. Assign the view to the internal database and a project

  1. After previewing, you can save the view created by clicking Join.

  2. Give a name to the view created, click Next, and select the internal database schema you want the view to be published in.

  3. Assign the created view to your project and click Create View. Assign project

  4. To access the virtualized table, go to Menu > Service Details. You can use the credentials here to access all the merged and virtualized views in your IBM Cloud Pak for Data instance. Service detials

Note: Before assigning to project, make sure you have a new project created.

Troubleshooting

  • If the user login is unable to access data virtualization, go to Menu > Manage Users. Manage users

  • Click Add User and select your current user login details. Add users

  • If you are unable to preview your join or getting zero rows of data, go to Menu > Service Details and under Virtual object access control, click Disable. Disable button

Summary

The typical flow of a data science project begins with aggregating and performing data mining on the data from various sources. This tutorial shows you to virtualize data for your data science project and continue the next steps. The main advantage of the data virtualization capabilities of IBM Cloud Pak for Data is the ability to seamlessly join databases from any data source, providing an easy integration platform.

Continue this series with a code pattern titled Monitor your machine learning models using Watson OpenScale in IBM Cloud Pak for Data.