This tutorial is part of a series that explores telecom call-drop predictions using IBM Cloud Pak® for Data, data virtualization, IBM Watson® OpenScale™, and Cognos® Analyics.
|201||Query across distributed data sources as one: Data virtualization for data analytics||Tutorial|
|201||Monitor your machine learning models using Watson OpenScale in IBM Cloud Pak for Data||Pattern|
|301||Build dashboards in Cognos Analytics on IBM Cloud Pak for Data||Tutorial|
|301||Predict, manage, and monitor the call drops of cell towers using IBM Cloud Pak for Data||Pattern|
There’s a growing number of large-scale enterprises that need to perform targeted data analytics on multiple data sources. The integration of data from multiple sources is a challenging task. Data virtualization presents a modern approach to data integration. Unlike ETL solutions that replicate data, data virtualization leaves the data in source systems, simply exposing an integrated view of all the data to data consumers. As business users drill down into reports, data virtualization fetches the data in real time from the underlying source systems. Data virtualization proves that connecting to data is far superior to collecting it.
Data can reside in multiple data marts. And data virtualization, within IBM Cloud Pak for Data, can perform data integration seamlessly. Data virtualization can connect to data wherever it resides and provides the ability to view, access, manipulate and analyze data without the need to know or understand its physical format and location. Data virtualization creates virtual tables to join data from different data sources, then allows running queries against the resulting virtual table.
In this tutorial, we’ll demonstrate how data virtualization can be achieved on various data sources with an Oracle database hosted on Amazon Web Services with IBM Cloud Db2® Warehouse and IBM Db2 on Cloud. You can also connect to Netezza Performance Server by selecting from the menu.
In this tutorial, you’ll learn:
- The concept of data virtualization
- How to create connections from databases hosted on multiple environments, including Amazon Web Services, Azure, IBM Cloud, Netezza Performance Server, and more
- How to create views from Joins and publish data to your current project
- IBM Cloud Pak for Data with the Data Virtualization add-on enabled.
- Two or more database sources. Note: For this tutorial, we’ll demonstrate with:
Completing this tutorial should take about 30 minutes.
Step 1. Add data connection to your IBM Cloud Pak for Data instance
From your IBM Cloud Pak for Data instance, go to the Virtualize Data tab. From the top-left navigation menu, click Collect Data > Virtualize Data.
Click Add to add a database connection to your IBM Cloud Pak for Data instance.
Step 2. Add the database connection as a data source in IBM Cloud Pak for Data
Enter the desired connection name and select Connection Type. In this tutorial, we demonstrate using an Oracle database on AWS and Db2 on IBM Cloud.
Once you’ve selected the database, enter the necessary connection information for the database: host, port, username, password.
Enter the database credentials.
Click Test Connection, then Save.
Repeat the same steps for another database — in this case, Db2 on Cloud.
Step 3. Virtualize your data
Select the tables you wish to virtualize. The tables will be filtered based on their source — Oracle or IBM database, for example. Check the box against the table name and click Add to Cart.
Once you have completed adding the tables to the cart, click View Cart. Assign the tables to your current working project and click Virtualize.
Step 4. Merge the virtualized data
Once you have virtualized data, click View My Data.
Your data will be added to the My Data section, where all your local/virtualized data will reside.
Select the desired tables from the virtualized list by checking the box against the names of the tables, then click Join View.
Next, you will be a given a view of the schema of two tables. You can drag and drop to represent the Primary Key and Foreign Key relationship. Click Preview to see how it has joined.
- Make sure you drag Primary Key to Foreign Key (only this direction), in order to make the join work.
- If you wish to combine more than two tables or do a different kind of join, click SQL Editor and write your SQL query. The only requirement is the data being referenced in the query must be virtualized.
Step 5. Assign the view to the internal database and a project
After previewing, you can save the view created by clicking Join.
Give a name to the view created, click Next, and select the internal database schema you want the view to be published in.
Assign the created view to your project and click Create View.
To access the virtualized table, go to Menu > Service Details. You can use the credentials here to access all the merged and virtualized views in your IBM Cloud Pak for Data instance.
Note: Before assigning to project, make sure you have a new project created.
If the user login is unable to access data virtualization, go to Menu > Manage Users.
Click Add User and select your current user login details.
If you are unable to preview your join or getting zero rows of data, go to Menu > Service Details and under Virtual object access control, click Disable.
The typical flow of a data science project begins with aggregating and performing data mining on the data from various sources. This tutorial shows you to virtualize data for your data science project and continue the next steps. The main advantage of the data virtualization capabilities of IBM Cloud Pak for Data is the ability to seamlessly join databases from any data source, providing an easy integration platform.
Continue this series with a code pattern titled Monitor your machine learning models using Watson OpenScale in IBM Cloud Pak for Data.