A technical deep-dive on integrating Cloudera Data Platform and IBM Cloud Pak for Data
Learn how the two data and AI platforms communicate
This blog post is the first of a three-part series authored by software developers and architects at IBM and Cloudera. This first post focuses on integration points of the recently announced joint offering: Cloudera Data Platform for IBM Cloud Pak for Data. The second post will look at how Cloudera Data Platform was installed on IBM Cloud using Ansible. And the third post will focus on lessons learned from installing, maintaining, and verifying the connectivity of the two platforms. Let’s get started!
In this post we will be outlining the main integration points between Cloudera Data Platform and IBM Cloud Pak for Data, and explaining how the two distinct data and AI platforms can communicate with each other. Integrating two platforms is made easy with capabilities available out of the box for both IBM Cloud Pak for Data and Cloudera Data Platform. Establishing a connection between the two is just a few clicks away.
Architecture diagram showing Cloudera Data Plaform for Cloud Pak for Data
In our view, there are three key points to integrating Cloudera Data Platform and IBM Cloud Pak for Data; all other services piggyback on one of these:
- Apache Knox Gateway (available on Cloudera)
- Execution Engine for Apache Hadoop (available on IBM Cloud Pak for Data)
- Db2 Big SQL (available on IBM Cloud Pak for Data)
Read on for more information about how each integration point works. For a demonstration on how to use data from Hive and Db2 check out the video below where we join the data using Data Virtualization and then display it with IBM Cognos Analytics check out the video below.
Apache Knox Gateway
To truly be secure, a Hadoop cluster needs Kerberos. However, Kerberos requires a client-side library and complex client-side configuration. This is where the Apache Knox Gateway (“Knox”) comes in. By encapsulating Kerberos, Knox eliminates the need for client software or client configuration and, thus, simplifies the access model. Knox integrates with identity management and SSO systems, such as Active Directory and LDAP, to allow identities from these systems to be used for access to Cloudera clusters.
Knox dashboard showing the list of supported services
Creating a JDBC connection to Impala via Knox
List of connections on IBM Cloud Pak for Data
Execution Engine for Apache Hadoop
The Execution Engine for Apache Hadoop service is installed on both IBM Cloud Pak for Data and on the worker nodes of a Cloudera Data Platform deployment. Execution Engine for Hadoop allows users to:
- Browse remote Hadoop data (HDFS, Impala, or Hive) through platform-level connections
- Cleanse and shape remote Hadoop data (HDFS, Impala, or Hive) with Data Refinery
- Run a Jupyter notebook session on the remote Hadoop system
- Access Hadoop systems with basic utilities from RStudio and Jupyter notebooks
After installing and configuring the services on IBM Cloud Pak for Data and Cloudera Data Platform, you can create platform-level connections to HDFS, Impala, and Hive.
Execution Engine for Hadoop connection options
Once a connection has been established, data from HDFS, Impala, or Hive can be browsed and imported.
Browsing through an HDFS connection made via Execution Engine for Hadoop
Data residing in HDFS, Impala or Hive can be cleaned and modified through Data Refinery on IBM Cloud Pak for Data.
Data Refinery allows for operations to be run on data
The Hadoop Execution Engine also allows for Jupyter notebook sessions to connect to a remote Hadoop system.
Jupyter notebook connecting to a remote HDFS
Db2 Big SQL
The Db2 Big SQL service is installed on IBM Cloud Pak for Data and is configured to communicate with a Cloudera Data Platform deployment. Db2 Big SQL allows users to:
- Query data stored on Hadoop services such as HDFS and Hive
- Query large amounts of data residing in a secured (Kerberized) or unsecured Hadoop-based platform
Once Big SQL is configured, you can choose what data to synchronize into tables. Once in a table, you can save the data to a project, run queries against it, or browse the data. Ranger, a Cloudera service that can be used to allow or deny access, is necessary to be used with Big SQL.
Synchronizing data from Hive to a Db2 table in Big SQL
Previewing synchronized data from Hive
Another benefit of configuring Db2 Big SQL to interact with your Cloudera cluster is that a JDBC connection is created that can be leveraged by many other IBM Cloud Pak for Data services, such as Data Virtualization, Cognos Analytics, and Watson Knowledge Catalog.
JDBC connection information for an instance of Big SQL
The BigSQL JDBC connection being consumed by Cognos Analytics
The BigSQL JDBC connection being consumed by DataStage
Summary and next steps
We hope you learned more about how integrate IBM Cloud Pak for Data and Cloudera Data Platform. Learn more about the Cloudera Data Platform for IBM Cloud Pak for Data by checking our the product page or visit the IBM Hybrid Data Management Community to post questions and talk to our experts.
If you enjoyed this, check out the video below where Omkar Nimbalkar and Nadeem Asghar discuss the IBM and Cloudera partnership.