Query across distributed data sources as one: Data virtualization for data analytics
Data virtualization using IBM Cloud Pak for Data
This tutorial is part of a series that explores telecom call-drop predictions using IBM Cloud Pak for Data, data virtualization, Watson OpenScale, and Cognos Analyics.
There’s a growing number of large-scale enterprises that need to perform targeted data analytics on multiple data sources. The integration of data from multiple sources is a challenging task. Data virtualization presents a modern approach to data integration. Unlike ETL solutions that replicate data, data virtualization leaves the data in source systems, simply exposing an integrated view of all the data to data consumers. As business users drill down into reports, data virtualization fetches the data in real time from the underlying source systems. Data virtualization proves that connecting to data is far superior to collecting it.
Data can reside in multiple data marts. And data virtualization, within IBM Cloud Pak for Data, can perform data integration seamlessly. Data virtualization can connect to data wherever it resides and provides the ability to view, access, manipulate and analyze data without the need to know or understand its physical format and location. Data virtualization creates virtual tables to join data from different data sources, then allows running queries against the resulting virtual table.
In this Tutorial, we’ll demonstrate how data virtualization can be achieved on various data sources with an Oracle Database hosted on Amazon Web Services with IBM Cloud Db2 Warehouse and IBM Db2 on Cloud.
After completing this tutorial, you’ll learn:
- The concept of data virtualization.
- How to create connections from databases hosted on multiple environments, including Amazon Web Services, Azure, IBM Cloud, or on-premise.
- How to create views from Joins and publish data to your current project.
Two or more Database sources. Note: For this tutorial, we’ll demonstrate with:
Completing this tutorial should take about 30 minutes.
Step 1: Add data connection to your IBM Cloud Pak for Data instance
- From your IBM Cloud Pak for Data instance, go to the
Virtualize Datatab. Click on the top left
navigation menu > Collect Data > Virtualize Data
- Click on
Addand Add a new db Connection to your IBM Cloud Pak for Data instance.
Step 2: Add the database connection as a data source in IBM Cloud Pak for Data
- Enter the desired connection name, and select the
Connection Type. In this Tutorial, we demonstrate using an Oracle database on AWS and Db2 on IBM Cloud.
- Once you have selected the database, enter the necessary connection information for the db- Host, Port, Username, Password.
- Enter the db credentials
Test Connection. After it is successful,
Repeat the same steps for another database. In this case, Db2 on Cloud.
Step 3: Virtualize your data
- Select the tables which you wish to Virtualize. The tables will be filtered based on their source. For example, Oracle db or IBM Db. Check the box against the table name and click on
Add to Cart.
- Once you have completed adding the tables to the cart, Click on
View Cart. Assign the tables to your current working project and click on
Step 4: Merge the virtualized data
- Once you have virtualized data, click on
View My Data.
- Your data will be added to the
My Datasection, where all your local/virtualized data will reside.
- Select the desired tables from the virtualized list, by checking the box against the names of the tables. Then, click on
- Next, you will be a given a view of the schema of 2 tables. You can drag and drop to represent the Primary Key and Foreign Key relationship. Then Click on
Previewto see how it has joined.
- Make sure you drag Primary Key to Foreign Key (only this direction), in order to make the join work.
- If you wish to combine more than 2 tables or do a different kind of join. Click on the
SQL Editorand write your SQL Query. The only requirement is the data being referenced in the query must be virtualized.
Step 5: Assign the view to the internal database and a project
- After viewing the
Preview, you can save the view created by clicking on
- Give a name to the view created and click on
nextand select the internal database schema you want the view to be published in.
- Assign the created view to your project and click on
- To access the Created Virtualized Table, go to
Menu > Service Details. You can use the credentials in this page to access all the merged and virtualized views in your IBM Cloud Pak for Data instance.
Note: Before assigning to project, make sure you have a new project created.
- If the user login is unable to access data virtualization: Go to
Menu > Manage Users.
- Click on
Add Userand select your current user login details.
- If you are unable to preview your join or getting zero rows of data
- Go to
Menu > Service Details
- Under the
Virtual object access controlsection, click on
The typical flow of a data science project begins with aggregating and performing mining on the data from various sources. This tutorial allows you to seamlessly virtualize data for your data science project and continue the next steps. The main advantage of the data virtualization capabilities of IBM Cloud Pak for Data is the ability to seamlessly join databases from any data source, providing an easy integration platform.