Think 2021: The developer ecosystem and IBM are building together Learn more

Query across distributed data sources as one: Data virtualization for data analytics

Introduction

This tutorial is part of a series that explores telecom call-drop predictions using IBM Cloud Pak® for Data, data virtualization, IBM Watson® OpenScale™, and Cognos® Analyics.

Level Topic Type
201 Query across distributed data sources as one: Data virtualization for data analytics Tutorial
201 Monitor your machine learning models using Watson OpenScale in IBM Cloud Pak for Data Pattern
301 Build dashboards in Cognos Analytics on IBM Cloud Pak for Data Tutorial
301 Predict, manage, and monitor the call drops of cell towers using IBM Cloud Pak for Data Pattern

There’s a growing number of large-scale enterprises that need to perform targeted data analytics on multiple data sources. The integration of data from multiple sources is a challenging task. Data virtualization presents a modern approach to data integration. Unlike ETL solutions that replicate data, data virtualization leaves the data in source systems, simply exposing an integrated view of all the data to data consumers. As business users drill down into reports, data virtualization fetches the data in real time from the underlying source systems. Data virtualization proves that connecting to data is far superior to collecting it.

Data can reside in multiple data marts. And data virtualization, within IBM Cloud Pak for Data, can perform data integration seamlessly. Data virtualization can connect to data wherever it resides and provides the ability to view, access, manipulate and analyze data without the need to know or understand its physical format and location. Data virtualization creates virtual tables to join data from different data sources, then allows running queries against the resulting virtual table.

In this tutorial, we’ll demonstrate how data virtualization can be achieved on various data sources with an Oracle database hosted on Amazon Web Services with IBM Cloud Db2® Warehouse and IBM Db2 on Cloud. You can also connect to Netezza Performance Server by selecting from the menu.

Learning objectives

In this tutorial, you’ll learn:

  • The concept of data virtualization
  • How to create connections from databases hosted on multiple environments, including Amazon Web Services, Azure, IBM Cloud, Netezza Performance Server, and more
  • How to create views from Joins and publish data to your current project

Prerequisites

Estimated time

Completing this tutorial should take about 30 minutes.

Steps

Step 1. Add data connection to your IBM Cloud Pak for Data instance

  1. On your IBM Cloud Pak for Data instance, navigate to the Hamburger menu (≡), expand Data, and select Data virtualization.

  2. Click Add data source +, then select Create new connection to add a database connection to your IBM Cloud Pak for Data instance. Add data connection

Step 2. Add the database connection as a data source in IBM Cloud Pak for Data

  1. Select the type of connection (in this tutorial, we demonstrate using an Oracle database on AWS and Db2 on IBM Cloud).

  2. Provide a name for the connection and enter the necessary connection information for the database: database name, host, port, username, and password. Click Create.

  3. Click Skip to skip adding to a remote connector. The database connection will now be displayed on the screen under Data Sources. Add db connection details

  4. Repeat the same steps for the other database.

Step 3. Virtualize your data

  1. Expand the Data Virtualization menu, and select Virtualize.

  2. The tables available in the data sources added to Data Virtualization will be listed on the screen. The tables can be filtered based on their source, Oracle or IBM database, for example. Check the box against the table name, and click Add to Cart. Repeat for all of the tables that you want to virtualize. Select table

  3. After you have completed adding the tables to the cart, click View Cart. Assign the tables to your current working project, and click Virtualize. Assign tables

Step 4. Merge the virtualized data

  1. After you have virtualized data, click View my virtualized data.

  2. Your data will be added to the My virtualized data section, where all of your local/virtualized data will reside.

  3. Select the desired tables from the virtualized list by checking the box against the names of the tables, then click Join. Merge data

  4. Next, you will be a given a view of the schema of two tables. You can drag and drop to represent the Primary Key and Foreign Key relationship. Click Preview to see how it has joined. Key relationship

Note:

  1. Make sure you drag Primary Key to Foreign Key (only this direction) to make the join work.
  2. If you want to combine more than two tables or do a different kind of join, click Open in SQL editor and write your SQL query. The only requirement is that the data being referenced in the query must be virtualized.

Step 5. Assign the view to the internal database and a project

  1. Click Next.

  2. Edit the names of the columns if you want, then click Next.

  3. Provide a name for the view and select the internal database schema within which you want the view to be published.

  4. Assign the created view to your project by selecting the Project radio button and then choosing your project’s name in the drop-down list. Click Create view. Assign project

  5. To access the virtualized table, choose Service settings in the Data virtualization menu. You can use the credentials here to access all of the merged and virtualized views in your IBM Cloud Pak for Data instance. Service details

Note: Before assigning to a project, make sure you have a new project created.

Troubleshooting

  • If a user is unable to access data virtualization, log in as the Admin, and go to User management in the Data Virtualization menu.

  • Click Add users + to add Cloud Pak for Data users as Data Virtualization users. Add users

  • If you are unable to preview your join or getting zero rows of data, go to Service settings under the Data Virtualization menu, and click Disable in front of Restrict visibility. Disable button

Summary

The typical flow of a data science project begins with aggregating and performing data mining on the data from various sources. This tutorial shows you to virtualize data for your data science project and continue the next steps. The main advantage of the data virtualization capabilities of IBM Cloud Pak for Data is the ability to seamlessly join databases from any data source, providing an easy integration platform.

Continue this series with a code pattern titled Monitor your machine learning models using Watson OpenScale in IBM Cloud Pak for Data.