This tutorial highlights the ability of IBM Cloud Pak for Data to integrate with other market-leading data management platforms, such as Cloudera Data Platform. Due to the complexity of installing Cloudera Data Platform and IBM Cloud Pak for Data, it is assumed you already have access to these environments. If not, you can still follow along and get insights into how integration can be accomplished.
View the following steps as a video or follow along step by step.
NOTE: Database table and schema names may differ between the video and the instructions listed below, but the steps performed should be similar.
Learn to use IBM Cloud Pak for Data to integrate with other data management platforms, such as Cloudera Data Platform.
- IBM Cloud Pak for Data
- Cloudera Data Platform
Completing this tutorial should take about 45 minutes.
Overview of technologies
Let’s start with an overview of the technologies that we will be using.
Cloudera Data Platform
Cloudera Data Platform (CDP Private Cloud) is built for hybrid cloud, connecting on-premises environments to public clouds. It provides a disaggregation of compute and storage, allowing for independent scaling of compute and storage clusters. Analytics run on containerized compute nodes, scalable object store, and a secure data lake.
- Cloudera Manager provides administration of your Cloudera cluster, including operations for installation, upgrading, and host management and monitoring.
- Apache Hive runs over the Hadoop framework and provides an SQL-like interface for processing and querying the HDFS data.
- Apache Impala is similar to Hive, but with low latency and high concurrency it provides a better option for interactive computing.
- Apache Knox is an application gateway for interacting with the REST APIs and UIs of Hadoop deployments. Knox presents consumers with one endpoint for access to all the required services across multiple Hadoop clusters.
IBM Cloud Pak for Data
IBM Cloud Pak for Data is a unified, pre-integrated data and AI platform that runs natively on the Red Hat OpenShift Container Platform. Services are delivered with an open and extensible cloud native platform for collecting, organizing, and analyzing data. It’s a single interface to perform end-to-end analytics with built-in governance. It also supports and governs the end-to-end AI workflow.
- IBM Db2 is a relational database management system (RDBMS). Along with providing the Db2 relational database, it includes a family of tools that allows you to manage both structured and unstructured data across on-premises and multi-cloud environments.
- IBM Big SQL provides a single database connection to query data across Hadoop and other relational/NoSQL databases. Data can reside on local systems or on the cloud.
- Data Virtualization can query data across many systems without having to copy and replicate data. It’s accurate because you’re querying the latest data at its source.
- IBM Cognos Analytics Dashboard offers self-service analytics, infused with AI and machine learning, enabling you to create stunning visualizations and share your findings through dashboards and reports.
Note: We have tested CDP versions 7.1.6 and 7.1.7 and IBM Cloud Pak for Data 3.5.2 and 4.0.1.
We provisioned eight VMs running CentOS 7.9 on IBM Cloud to host our CDP instance.
We also provisioned two additional virtual servers: one for a bastion node so we could easily SSH into all the other nodes, and the other for an Active Directory server to manage authentication for our cluster.
The following image shows the nodes listed in our IBM Cloud devices list.
cid-bastionis our bastion
cid-adcis our Active Directory server
cid-vm-03are our master nodes
cid-vm-06are our worker nodes
cid-vm-08are our edge nodes
Windows 2019 Server
We set up a Windows 2019 Server (
cid-adc) to take advantage of its Active Directory service for user management and its domain services to manage DNS. We created a private domain (
cdplab.local) so most of the traffic would stay on private networks.
We manage our CDP using Cloudera Manager running on
cid-vm-01. Below, you see our
cpdlab cluster and list of hosts.
The services available on the cluster are shown below.
IBM Cloud Pak for Data configuration
IBM Cloud Pak for Data comes pre-packaged with a host of tools and services that can be instantiated. We provisioned Data Virtualization, Db2, Big SQL, and Cognos Analytics.
Step 1. Load data on IBM Cloud Pak for Data
In IBM Cloud Pak for Data, we will add data to the Db2 instance. The Sales.csv data is meant to represent transactions and contains profit, revenue, sales price, and more.
The Db2 service has been provisioned in our IBM Cloud Pak for Data instance. To view it, from the main IBM Cloud Pak for Data menu, click Services > Instances.
Click on the Db2 instance name to bring up the details panel.
Click Open database in the top-right corner, then from the Db2 instance Summary panel, click Load Data.
From the File selection window, upload the Sales.csv file, then click Next.
To create a new sales table, first create a new
GREAT_OUTDOORSschema, click New table +, enter
SALESas the new table name, and click Create.
Once the table is created, click Next.
You can then view all of the data rows that will be uploaded into our new
SALEStable. Note that you can toggle the Header in first row button to view the column header names.
Click Next to start loading the data. If successful, there should be no error messages.
To view the data, from the Db2 summary panel, click Explore > Tables.
Step 2. Create a JDBC connection to Db2
After creating the data, we need to create a new connection to access it. Note that this task requires you to log in as
admin on your IBM Cloud Pak for Data console.
From the main menu, select Data > Platform connections.
From the Platform connections panel, click New connection.
From the New connection panel, in the IBM list of connection options, select Db2.
To fill in the data for the new connection, take the values from the following two panels:
- Panel 1: The Db2 instance panel, which you can navigate to by selecting Services > Instances from the main menu on the IBM Cloud Pak for Data main screen, then clicking on the Db2 instance name.
- Panel 2: From the previous details screen, click Open database, then select Connection Information from the Summary drop-down menu.
Transfer the data from these panels into the new connection panel and click Test to ensure that it works.
- Enter a unique name for the connection Name.
- Database name can be found in panel 1.
- Hostname can be found in panel 2.
- Port can be parsed from the JDBC Connection URL (SSL) found in panel 1.
- Username and password are your IBM Cloud Pak for Data credentials.
- The SSL certificate can be downloaded from panel 1.
Click Create to create the connection.
Step 3. Load data on Cloudera Data Platform
On the Cloudera side, we will use HDFS to populate our Hive database. This step uses the data set Products.csv, which contains information about products (brand, description, price, ID, etc.).
The first step is to add the data to a user’s HDFS directory. Download the data where appropriate and run the command below (replace the last value with a directory you have access to):
hdfs dfs -put Products.csv /user/<your-user-name>
Next, we connect to Hive using the Beeline CLI tool. To do this, download the Hive on Tez configuration file by selecting the Download Client Configuration option listed under the Actions drop-down menu for our Cloudera Hive on Tez service.
Unzip the downloaded file and open the
beeline-site.xmlfile to find the URL associated with the property
beeline-hs2-jdbc-url-hive_on_tez. Copy this value and exit the file.
Log into one of the nodes in your IBM Cloud Pak for Data cluster and start the Beeline CLI by typing
NOTE: The authentication process will depend on how you configured your IBM Cloud Pak for Data Private instance.
Connect to the Hive service by typing the command below, replacing the
jdbc-urlvalue with the value copied from the previous step:
Use the following SQL command to create a new database and switch context to it:
CREATE DATABASE great_outdoors; USE great_outdoors;
Use the following SQL command to create a new table for the products:
CREATE TABLE products ( `Product_number` INT, `Product_line` STRING, `Product_type` STRING, `Product` STRING, `Introduction_date` STRING, `Product_brand` STRING, `Product_color` STRING, `Product_size` STRING, `Product_description` STRING, `Unit_cost` STRING, `Unit_price` STRING) row format delimited fields terminated by ',' STORED AS textfile TBLPROPERTIES('transactional'='false');
Use the following SQL command to load the data from HDFS to the table:
LOAD DATA INPATH '/<user-dir>/Products.csv' INTO TABLE products;
If the data is loaded correctly, you can query the data by running the SQL command below:
SELECT * from products;
Step 4. Use IBM Big SQL to synchronize data from Hive table into IBM Cloud Pak for Data
At this point we’ll use Big SQL to synchronize data from Hive to Db2, so make sure you have the service installed on your IBM Cloud Pak for Data cluster.
Open the Big SQL service by clicking Open in the Action (⋮) menu.
From the Summary drop-down, click the Run SQL option.
Click Create new +, enter the following SQL command, and click Run all:
CALL SYSHADOOP.HCAT_SYNC_OBJECTS('GREAT_OUTDOORS', 'PRODUCTS', 't');
From the drop-down menu, click Explore > Hadoop tables. Verify that the product data shows up when going to the
GREAT_OUTDOORSschema and viewing the
Create a JDBC connection to Big SQL
From the main menu, select Data > Platform connections.
From the Platform connections panel, click New connection.
From the New connection panel, in the IBM list of connection options, select Db2 Big SQL. The following image is an example of a Big SQL connect you can build your own connection from.
NOTE: To determine host name, follow the procedures outlined in the Setting up a connection to Db2 Big SQL instructions. We used the “Passthrough secure route method.” The result was a route we could use for the Hostname or IP Address field.
For Username and Password, use your Cloudera Data Platform credentials.
For SSL certificate, you must run an
OpenSSLcommand to the hostname just generated:
openssl s_client -showcerts -servername <hostname without port> -connect <hostname:port>
This will return a chain certificate, so you will need to copy both and paste into the SSL certificate field. Be sure to remove any extraneous lines outside of the
END certificate delimiters.
- Click Test to ensure that the connection works.
Step 5. Use Data Virtualization to merge data from Cloudera Data Platform and IBM Cloud Pak for Data
- From the IBM Cloud Pak for Data main menu, open the Data menu and click Data Virtualization.
Add data sources
The default landing page for Data Virtualization is Data Sources. Click Add data source.
Choose the option to Select existing connection and choose the cdp-pvc-base connection that represents the connection we made to Big SQL, which reads from our Hive databases on Cloudera, then click Add.
Repeat the step for the db2-on-cpd connection representing the connection we made to Db2.
From the main drop-down menu, click Virtualize, which is listed under the Virtualization option.
From the Table list, select each of the tables we have created and add them to the cart:
SALEStable associated with the db2-on-cpd connection (data from IBM Cloud Pak for Data):
PRODUCTStable associated with the cpd-pvc-base connection (data from Cloudera Data Platform):
Click Add to cart.
Once available, click View cart.
Resolve any conflicts, assign your IBM Cloud Pak for Data project, then click Virtualize.
Once complete, click View my virtualized data on the modal dialog.
NOTE: You can also navigate to your data by clicking on the My virtualized data option from the main Data Virtualization menu.
You should now see your new virtualized data tables listed.
PRODUCTStables, then click Join in the table header.
We won’t need all of the columns for our join, so start by de-selecting all of the column names. The columns we are interested in are:
Connect the primary keys together by clicking in the product number row in the
SALEStable, the drag and drop the cursor onto the PRODUCT_NUMBER row in the
If successful, the table in the right hand margin should reflect the joining of the keys.
Click Open in SQL editor at the top of the right-hand margin.
In the editor window, you will see the SQL statement that was generated from the actions from the previous step. We will need to modify this slightly so we get total
SUMvalues for revenue and gross profit. The
GROUP BYstatements allow us to generate the sums, which we can use to create some meaningful Cognos Analytics dashboards (see next steps). Note that we are also changing the view name.
CREATE VIEW HIGHEST_REVENUE_PRODUCTS AS SELECT "HIGHREV"."PRODUCTS"."PRODUCT_NUMBER" AS "HIGHREV_PRODUCTS_PRODUCT_NUMBER", "HIGHREV"."PRODUCTS"."PRODUCT" AS "HIGHREV_PRODUCTS_PRODUCT", "HIGHREV"."PRODUCTS"."PRODUCT_TYPE" AS "HIGHREV_PRODUCTS_PRODUCT_TYPE", SUM("HIGHREV"."SALES"."REVENUE") AS "HIGHREV_SALES_REVENUE", SUM("HIGHREV"."SALES"."Gross profit") AS "HIGHREV_SALES_Gross profit" FROM "HIGHREV"."PRODUCTS", "HIGHREV"."SALES" WHERE "HIGHREV"."PRODUCTS"."PRODUCT_NUMBER" = "HIGHREV"."SALES"."Product number" GROUP BY "HIGHREV"."PRODUCTS"."PRODUCT_NUMBER", "HIGHREV"."PRODUCTS"."PRODUCT", "HIGHREV"."PRODUCTS"."PRODUCT_TYPE", "HIGHREV"."SALES"."REVENUE", "HIGHREV"."SALES"."Gross profit"
Click Run all to execute the SQL.
To see our new view, navigate back to the My virtualized data panel. You should see the new view.
Use the action menu for the view to select the Preview option.
Step 6. Build Cognos Analytics dashboard to visualize merged data
To use Cognos Analytics, we will need to connect to Big SQL to access the Hive data on our Cloudera cluster. Grab the port number for our Big SQL instance from the Red Hat OpenShift console.
Launch Cognos Analytics instance
From the IBM Cloud Pak for Data Instances panel, locate the Cognos Analytics service, then click Open from the action menu.
This will bring up the Cognos Analytics service main page.
Add Big SQL connection
To navigate to the Big SQL connection panel, click the Manage option from the main menu; then from the Data server connections panel, click the + button to create a new connection; and from the Type list, select IBM Big SQL.
Update the fields to match your Big SQL service connection (including port number), and click Test to try it out.
Note that for our setup, we have SSL enabled and need to pass in our user ID and password.
- If successful, change the name of the connection and click Save.
Once you have your connections and data sources configured, create a new dashboard using the provided visualization tools available in Cognos. Following is an example dashboard (see earlier video for details) that shows the highest revenue product types, and the top 20 products based on revenue.
Summary and next steps
This tutorial covered examples of how to use platform connections in IBM Cloud Pak for Data to access data sources on IBM Cloud Pak for Data and other platforms, including Cloudera Data Platform. It details the steps required to create a Db2 connection to data from IBM Cloud Pak for Data, and a Big SQL connection to access Hive data on Cloudera. It then shows how the data can be combined and transformed using Data Virtualization. It also provides examples of how to import this new virtualized data into Cognos Analytics, where dashboards can be built to pull out insights using multiple visualizations.
You can learn more about remote connectors on IBM Cloud Pak for Data in a tutorial titled Improve performance for your data virtualization data sources with remote connectors. And check out IBM Cloud Pak for Data to learn even more.