2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

IBM Developer Blog

Follow the latest happenings with IBM Developer and stay in the know.

Learn how the two data and AI platforms communicate


This blog post is the first of a three-part series authored by software developers and architects at IBM and Cloudera. This first post focuses on integration points of the recently announced joint offering: Cloudera Data Platform for IBM Cloud Pak for Data. The second post will look at how Cloudera Data Platform was installed on IBM Cloud using Ansible. And the third post will focus on lessons learned from installing, maintaining, and verifying the connectivity of the two platforms. Let’s get started!

In this post we will be outlining the main integration points between Cloudera Data Platform and IBM Cloud Pak for Data, and explaining how the two distinct data and AI platforms can communicate with each other. Integrating two platforms is made easy with capabilities available out of the box for both IBM Cloud Pak for Data and Cloudera Data Platform. Establishing a connection between the two is just a few clicks away.

Architecture diagram Architecture diagram showing Cloudera Data Plaform for Cloud Pak for Data

In our view, there are three key points to integrating Cloudera Data Platform and IBM Cloud Pak for Data; all other services piggyback on one of these:

Read on for more information about how each integration point works. For a demonstration on how to use data from Hive and Db2 check out the video below where we join the data using Data Virtualization and then display it with IBM Cognos Analytics check out the video below.

Apache Knox Gateway

To truly be secure, a Hadoop cluster needs Kerberos. However, Kerberos requires a client-side library and complex client-side configuration. This is where the Apache Knox Gateway (“Knox”) comes in. By encapsulating Kerberos, Knox eliminates the need for client software or client configuration and, thus, simplifies the access model. Knox integrates with identity management and SSO systems, such as Active Directory and LDAP, to allow identities from these systems to be used for access to Cloudera clusters.

Knox dashboard showing the list of supported services Knox dashboard showing the list of supported services

Cloudera services such as Impala, Hive, and HDFS can be configured with Knox, allowing JDBC connections to easily be created in IBM Cloud Pak for Data.

Creating a JDBC connection to Impala via Knox Creating a JDBC connection to Impala via Knox

List of connections on IBM Cloud Pak for Data List of connections on IBM Cloud Pak for Data

Execution Engine for Apache Hadoop

The Execution Engine for Apache Hadoop service is installed on both IBM Cloud Pak for Data and on the worker nodes of a Cloudera Data Platform deployment. Execution Engine for Hadoop allows users to:

  • Browse remote Hadoop data (HDFS, Impala, or Hive) through platform-level connections
  • Cleanse and shape remote Hadoop data (HDFS, Impala, or Hive) with Data Refinery
  • Run a Jupyter notebook session on the remote Hadoop system
  • Access Hadoop systems with basic utilities from RStudio and Jupyter notebooks

After installing and configuring the services on IBM Cloud Pak for Data and Cloudera Data Platform, you can create platform-level connections to HDFS, Impala, and Hive.

Execution Engine for Hadoop connection options Execution Engine for Hadoop connection options

Once a connection has been established, data from HDFS, Impala, or Hive can be browsed and imported.

Browsing through an HDFS connection made via Execution Engine for Hadoop Browsing through an HDFS connection made via Execution Engine for Hadoop

Data residing in HDFS, Impala or Hive can be cleaned and modified through Data Refinery on IBM Cloud Pak for Data.

Data Refinery allows for operations to be run on data Data Refinery allows for operations to be run on data

The Hadoop Execution Engine also allows for Jupyter notebook sessions to connect to a remote Hadoop system.

Jupyter notebook connecting to a remote HDFS Jupyter notebook connecting to a remote HDFS

Db2 Big SQL

The Db2 Big SQL service is installed on IBM Cloud Pak for Data and is configured to communicate with a Cloudera Data Platform deployment. Db2 Big SQL allows users to:

  • Query data stored on Hadoop services such as HDFS and Hive
  • Query large amounts of data residing in a secured (Kerberized) or unsecured Hadoop-based platform

Once Big SQL is configured, you can choose what data to synchronize into tables. Once in a table, you can save the data to a project, run queries against it, or browse the data. Ranger, a Cloudera service that can be used to allow or deny access, is necessary to be used with Big SQL.

Synchronizing data from Hive to a Db2 table in Big SQL Synchronizing data from Hive to a Db2 table in Big SQL

Previewing synchronized data from Hive Previewing synchronized data from Hive

Another benefit of configuring Db2 Big SQL to interact with your Cloudera cluster is that a JDBC connection is created that can be leveraged by many other IBM Cloud Pak for Data services, such as Data Virtualization, Cognos Analytics, and Watson Knowledge Catalog.

JDBC connection information for an instance of Big SQL JDBC connection information for an instance of Big SQL

The BigSQL JDBC connection being consumed by Cognos Analytics The BigSQL JDBC connection being consumed by Cognos Analytics

The BigSQL JDBC connection being consumed by DataStage The BigSQL JDBC connection being consumed by DataStage

Summary and next steps

We hope you learned more about how integrate IBM Cloud Pak for Data and Cloudera Data Platform. Learn more about the Cloudera Data Platform for IBM Cloud Pak for Data by checking our the product page or visit the IBM Hybrid Data Management Community to post questions and talk to our experts.

If you enjoyed this, check out the video below where Omkar Nimbalkar and Nadeem Asghar discuss the IBM and Cloudera partnership.