Digital Developer Conference: Cloud Security 2021 -- Build the skills to secure your cloud and data Register free

Merge data from Db2 and Hive with Data Virtualization and Big SQL

This tutorial highlights the ability of IBM Cloud Pak for Data to integrate with other market-leading data management platforms, such as Cloudera Data Platform. Due to the complexity of installing Cloudera Data Platform and IBM Cloud Pak for Data, it is assumed you already have access to these environments. If not, you can still follow along and get insights into how integration can be accomplished.

View the following steps as a video or follow along step by step.

NOTE: Database table and schema names may differ between the video and the instructions listed below, but the steps performed should be similar.

Learning objectives

Learn to use IBM Cloud Pak for Data to integrate with other data management platforms, such as Cloudera Data Platform.

Prerequisites

  • IBM Cloud Pak for Data
  • Cloudera Data Platform

Estimated time

Completing this tutorial should take about 45 minutes.

Overview of technologies

Let’s start with an overview of the technologies that we will be using.

Cloudera Data Platform

Cloudera Data Platform (CDP Private Cloud) is built for hybrid cloud, connecting on-premises environments to public clouds. It provides a disaggregation of compute and storage, allowing for independent scaling of compute and storage clusters. Analytics run on containerized compute nodes, scalable object store, and a secure data lake.

  • Cloudera Manager provides administration of your Cloudera cluster, including operations for installation, upgrading, and host management and monitoring.
  • Apache Hive runs over the Hadoop framework and provides an SQL-like interface for processing and querying the HDFS data.
  • Apache Impala is similar to Hive, but with low latency and high concurrency it provides a better option for interactive computing.
  • Apache Knox is an application gateway for interacting with the REST APIs and UIs of Hadoop deployments. Knox presents consumers with one endpoint for access to all the required services across multiple Hadoop clusters.

IBM Cloud Pak for Data

IBM Cloud Pak for Data is a unified, pre-integrated data and AI platform that runs natively on the Red Hat OpenShift Container Platform. Services are delivered with an open and extensible cloud native platform for collecting, organizing, and analyzing data. It’s a single interface to perform end-to-end analytics with built-in governance. It also supports and governs the end-to-end AI workflow.

  • IBM Db2 is a relational database management system (RDBMS). Along with providing the Db2 relational database, it includes a family of tools that allows you to manage both structured and unstructured data across on-premises and multi-cloud environments.
  • IBM Big SQL provides a single database connection to query data across Hadoop and other relational/NoSQL databases. Data can reside on local systems or on the cloud.
  • Data Virtualization can query data across many systems without having to copy and replicate data. It’s accurate because you’re querying the latest data at its source.
  • IBM Cognos Analytics Dashboard offers self-service analytics, infused with AI and machine learning, enabling you to create stunning visualizations and share your findings through dashboards and reports.

Architecture overview

Note: We have tested CDP versions 7.1.6 and 7.1.7 and IBM Cloud Pak for Data 3.5.2 and 4.0.1.

CDP configuration

We provisioned eight VMs running CentOS 7.9 on IBM Cloud to host our CDP instance.

We also provisioned two additional virtual servers: one for a bastion node so we could easily SSH into all the other nodes, and the other for an Active Directory server to manage authentication for our cluster.

Flow

The following image shows the nodes listed in our IBM Cloud devices list.

Nodes

  • cid-bastion is our bastion
  • cid-adc is our Active Directory server
  • cid-vm-01 to cid-vm-03 are our master nodes
  • cid-vm-04 to cid-vm-06 are our worker nodes
  • cid-vm-07 and cid-vm-08 are our edge nodes

Windows 2019 Server

We set up a Windows 2019 Server (cid-adc) to take advantage of its Active Directory service for user management and its domain services to manage DNS. We created a private domain (cdplab.local) so most of the traffic would stay on private networks.

Cloudera Manager

We manage our CDP using Cloudera Manager running on cid-vm-01. Below, you see our cpdlab cluster and list of hosts.

Cluster and hosts

The services available on the cluster are shown below.

Services

IBM Cloud Pak for Data configuration

IBM Cloud Pak for Data comes pre-packaged with a host of tools and services that can be instantiated. We provisioned Data Virtualization, Db2, Big SQL, and Cognos Analytics.

IBM Cloud Pak for Data

Steps

Step 1. Load data on IBM Cloud Pak for Data

In IBM Cloud Pak for Data, we will add data to the Db2 instance. The Sales.csv data is meant to represent transactions and contains profit, revenue, sales price, and more.

  1. The Db2 service has been provisioned in our IBM Cloud Pak for Data instance. To view it, from the main IBM Cloud Pak for Data menu, click Services > Instances. View service

  2. Click on the Db2 instance name to bring up the details panel. Details panel

  3. Click Open database in the top-right corner, then from the Db2 instance Summary panel, click Load Data. Load Data

  4. From the File selection window, upload the Sales.csv file, then click Next. Upload

  5. To create a new sales table, first create a new GREAT_OUTDOORS schema, click New table +, enter SALES as the new table name, and click Create. New table

  6. Once the table is created, click Next. Table created

  7. You can then view all of the data rows that will be uploaded into our new SALES table. Note that you can toggle the Header in first row button to view the column header names. View data

  8. Click Next to start loading the data. If successful, there should be no error messages.

  9. To view the data, from the Db2 summary panel, click Explore > Tables.

Step 2. Create a JDBC connection to Db2

After creating the data, we need to create a new connection to access it. Note that this task requires you to log in as admin on your IBM Cloud Pak for Data console.

  1. From the main menu, select Data > Platform connections. Create connection

  2. From the Platform connections panel, click New connection. New connection

  3. From the New connection panel, in the IBM list of connection options, select Db2. Connection types

  4. To fill in the data for the new connection, take the values from the following two panels:

    • Panel 1: The Db2 instance panel, which you can navigate to by selecting Services > Instances from the main menu on the IBM Cloud Pak for Data main screen, then clicking on the Db2 instance name. Instance details
    • Panel 2: From the previous details screen, click Open database, then select Connection Information from the Summary drop-down menu. Connection info
  5. Transfer the data from these panels into the new connection panel and click Test to ensure that it works. Connection test

    • Enter a unique name for the connection Name.
    • Database name can be found in panel 1.
    • Hostname can be found in panel 2.
    • Port can be parsed from the JDBC Connection URL (SSL) found in panel 1.
    • Username and password are your IBM Cloud Pak for Data credentials.
    • The SSL certificate can be downloaded from panel 1.
  6. Click Create to create the connection.

Step 3. Load data on Cloudera Data Platform

On the Cloudera side, we will use HDFS to populate our Hive database. This step uses the data set Products.csv, which contains information about products (brand, description, price, ID, etc.).

  1. The first step is to add the data to a user’s HDFS directory. Download the data where appropriate and run the command below (replace the last value with a directory you have access to):

    hdfs dfs -put Products.csv /user/<your-user-name>
    
  2. Next, we connect to Hive using the Beeline CLI tool. To do this, download the Hive on Tez configuration file by selecting the Download Client Configuration option listed under the Actions drop-down menu for our Cloudera Hive on Tez service. Client config

  3. Unzip the downloaded file and open the beeline-site.xml file to find the URL associated with the property beeline-hs2-jdbc-url-hive_on_tez. Copy this value and exit the file.

  4. Log into one of the nodes in your IBM Cloud Pak for Data cluster and start the Beeline CLI by typing beeline.

NOTE: The authentication process will depend on how you configured your IBM Cloud Pak for Data Private instance.

  1. Connect to the Hive service by typing the command below, replacing the jdbc-url value with the value copied from the previous step:

    !connect <jdbc-url>
    
  2. Use the following SQL command to create a new database and switch context to it:

    CREATE DATABASE great_outdoors;
    USE great_outdoors;
    
  3. Use the following SQL command to create a new table for the products:

    CREATE TABLE products (
    `Product_number` INT,
    `Product_line` STRING,
    `Product_type` STRING,
    `Product` STRING,
    `Introduction_date` STRING,
    `Product_brand` STRING,
    `Product_color` STRING,
    `Product_size` STRING,
    `Product_description` STRING,
    `Unit_cost` STRING,
    `Unit_price` STRING)
    row format delimited fields terminated by ','
    STORED AS textfile TBLPROPERTIES('transactional'='false');
    
  4. Use the following SQL command to load the data from HDFS to the table:

    LOAD DATA INPATH '/<user-dir>/Products.csv' INTO TABLE products;
    
  5. If the data is loaded correctly, you can query the data by running the SQL command below:

    SELECT * from products;
    

Step 4. Use IBM Big SQL to synchronize data from Hive table into IBM Cloud Pak for Data

At this point we’ll use Big SQL to synchronize data from Hive to Db2, so make sure you have the service installed on your IBM Cloud Pak for Data cluster.

Big SQL instance

  1. Open the Big SQL service by clicking Open in the Action (⋮) menu.

  2. From the Summary drop-down, click the Run SQL option.

  3. Click Create new +, enter the following SQL command, and click Run all:

    CALL SYSHADOOP.HCAT_SYNC_OBJECTS('GREAT_OUTDOORS', 'PRODUCTS', 't');
    
  4. From the drop-down menu, click Explore > Hadoop tables. Verify that the product data shows up when going to the GREAT_OUTDOORS schema and viewing the PRODUCTS table.

Create a JDBC connection to Big SQL

  1. From the main menu, select Data > Platform connections. Platform connections

  2. From the Platform connections panel, click New connection. New connection

  3. From the New connection panel, in the IBM list of connection options, select Db2 Big SQL. The following image is an example of a Big SQL connect you can build your own connection from. Connection details

NOTE: To determine host name, follow the procedures outlined in the Setting up a connection to Db2 Big SQL instructions. We used the “Passthrough secure route method.” The result was a route we could use for the Hostname or IP Address field.

  1. For Username and Password, use your Cloudera Data Platform credentials.

  2. For SSL certificate, you must run an OpenSSL command to the hostname just generated:

    openssl s_client -showcerts -servername <hostname without port> -connect <hostname:port>
    

This will return a chain certificate, so you will need to copy both and paste into the SSL certificate field. Be sure to remove any extraneous lines outside of the BEGIN and END certificate delimiters.

  1. Click Test to ensure that the connection works.

Step 5. Use Data Virtualization to merge data from Cloudera Data Platform and IBM Cloud Pak for Data

  1. From the IBM Cloud Pak for Data main menu, open the Data menu and click Data Virtualization. Start merge

Add data sources

  1. The default landing page for Data Virtualization is Data Sources. Click Add data source. Data sources

  2. Choose the option to Select existing connection and choose the cdp-pvc-base connection that represents the connection we made to Big SQL, which reads from our Hive databases on Cloudera, then click Add. Add connection

  3. Repeat the step for the db2-on-cpd connection representing the connection we made to Db2.

Virtualize tables

  1. From the main drop-down menu, click Virtualize, which is listed under the Virtualization option. Virtualize tables

  2. From the Table list, select each of the tables we have created and add them to the cart:

    • The SALES table associated with the db2-on-cpd connection (data from IBM Cloud Pak for Data): SALES table
    • The PRODUCTS table associated with the cpd-pvc-base connection (data from Cloudera Data Platform): Products table
  3. Click Add to cart.

  4. Once available, click View cart. Cart

  5. Resolve any conflicts, assign your IBM Cloud Pak for Data project, then click Virtualize.

  6. Once complete, click View my virtualized data on the modal dialog.

NOTE: You can also navigate to your data by clicking on the My virtualized data option from the main Data Virtualization menu.

View data option

You should now see your new virtualized data tables listed.

View tables

Join tables

  1. Select the SALES and PRODUCTS tables, then click Join in the table header.

  2. We won’t need all of the columns for our join, so start by de-selecting all of the column names. The columns we are interested in are:

    • SALES table: Product number, REVENUE, Gross profit
    • PRODUCTS table: PRODUCT_NUMBER, PRODUCT, PRODUCT_TYPE Select columns
  3. Connect the primary keys together by clicking in the product number row in the SALES table, the drag and drop the cursor onto the PRODUCT_NUMBER row in the PRODUCTS table. Connect keys

If successful, the table in the right hand margin should reflect the joining of the keys.

  1. Click Open in SQL editor at the top of the right-hand margin. Connect keys

  2. In the editor window, you will see the SQL statement that was generated from the actions from the previous step. We will need to modify this slightly so we get total SUM values for revenue and gross profit. The GROUP BY statements allow us to generate the sums, which we can use to create some meaningful Cognos Analytics dashboards (see next steps). Note that we are also changing the view name.

    CREATE VIEW HIGHEST_REVENUE_PRODUCTS
    AS
    SELECT
     "HIGHREV"."PRODUCTS"."PRODUCT_NUMBER" AS "HIGHREV_PRODUCTS_PRODUCT_NUMBER",
     "HIGHREV"."PRODUCTS"."PRODUCT" AS "HIGHREV_PRODUCTS_PRODUCT",
     "HIGHREV"."PRODUCTS"."PRODUCT_TYPE" AS "HIGHREV_PRODUCTS_PRODUCT_TYPE",
     SUM("HIGHREV"."SALES"."REVENUE") AS "HIGHREV_SALES_REVENUE",
     SUM("HIGHREV"."SALES"."Gross profit") AS "HIGHREV_SALES_Gross profit"
    FROM
     "HIGHREV"."PRODUCTS",
     "HIGHREV"."SALES"
    WHERE
     "HIGHREV"."PRODUCTS"."PRODUCT_NUMBER" = "HIGHREV"."SALES"."Product number"
    GROUP BY 
     "HIGHREV"."PRODUCTS"."PRODUCT_NUMBER",
     "HIGHREV"."PRODUCTS"."PRODUCT",
     "HIGHREV"."PRODUCTS"."PRODUCT_TYPE",
     "HIGHREV"."SALES"."REVENUE",
     "HIGHREV"."SALES"."Gross profit"
    
  3. Click Run all to execute the SQL. Run all

New view

To see our new view, navigate back to the My virtualized data panel. You should see the new view.

New view

Use the action menu for the view to select the Preview option.

Step 6. Build Cognos Analytics dashboard to visualize merged data

To use Cognos Analytics, we will need to connect to Big SQL to access the Hive data on our Cloudera cluster. Grab the port number for our Big SQL instance from the Red Hat OpenShift console.

Launch Cognos Analytics instance

From the IBM Cloud Pak for Data Instances panel, locate the Cognos Analytics service, then click Open from the action menu. Open Cognos

This will bring up the Cognos Analytics service main page.

Cognos home

Add Big SQL connection

  1. To navigate to the Big SQL connection panel, click the Manage option from the main menu; then from the Data server connections panel, click the + button to create a new connection; and from the Type list, select IBM Big SQL. Config connection

  2. Update the fields to match your Big SQL service connection (including port number), and click Test to try it out.

Note that for our setup, we have SSL enabled and need to pass in our user ID and password.

Connection parameters

  1. If successful, change the name of the connection and click Save.

Build dashboard

Once you have your connections and data sources configured, create a new dashboard using the provided visualization tools available in Cognos. Following is an example dashboard (see earlier video for details) that shows the highest revenue product types, and the top 20 products based on revenue.

Cognos final

Summary and next steps

This tutorial covered examples of how to use platform connections in IBM Cloud Pak for Data to access data sources on IBM Cloud Pak for Data and other platforms, including Cloudera Data Platform. It details the steps required to create a Db2 connection to data from IBM Cloud Pak for Data, and a Big SQL connection to access Hive data on Cloudera. It then shows how the data can be combined and transformed using Data Virtualization. It also provides examples of how to import this new virtualized data into Cognos Analytics, where dashboards can be built to pull out insights using multiple visualizations.

You can learn more about remote connectors on IBM Cloud Pak for Data in a tutorial titled Improve performance for your data virtualization data sources with remote connectors. And check out IBM Cloud Pak for Data to learn even more.