Overview

Skill Level: Intermediate

Supported versions by CP4D 2.1

Many companies have collected huge volumes of data in their data lakes, often implemented using Hadoop. Now they want to monetize their assets.This article shows a series of examples how you can access and analyze data from Hadoop using CP4D.

Ingredients

1)For HDP:

·         Version 2.5.6

·         Version 2.6.2

·         Version 2.6.3

·         Version 2.6.4

 

2)For CDH:

·         Version 5.12

·         Version 5.13

·         Version 5.14

Step-by-step

  1. Set up for integrating a Hadoop cluster

    Integration overview

    In a secure Hadoop cluster, only the services which support authentication are exposed to the outside world. Ports for unsecure services such as native HDFS, WebHDFS, WebHCat, Livy, etc. are hidden behind a firewall. Additionally, the Hadoop data nodes are typically connected to the management nodes via a private network and therefore not externally visible. Considering this, it is recommended to implement the Hadoop Integration Service package (DSXHI) for securely connecting to the Hadoop services.

    DSXHI sets up a gateway (based on Apache Knox) on the Hadoop edge node to expose the WebHDFS, WebHCat, Livy for Spark and Livy for Spark2 services to DSX. The gateway can – after authentication – redirect the requests to a service that already exists on the Hadoop platform. Additionally, DSXHI can create its own instance of the Livy for Spark and Livy for Spark2 services in case these have not been implemented as part of the Hadoop installation.

    Untitled 

     

    In the above picture, the DSXHI gateway will handle requests from the notebook interface and authenticate the user to the Hadoop cluster. The gateway directs the traffic to the appropriate service, HDFS, WebHCat or Livy for Spark/Spark2 running on the Hadoop cluster.

    2. Set up an edge node for the Hadoop integration software

    The Hadoop integration software should be installed on an edge node of your CDH cluster.

    To set up a healthy edge node, ensure that it meets the following requirements:

    ·         8 GB of memory

    ·         2 virtual CPUs

    ·         100 GB of disk storage, mounted and available in the /var directory

    ·         If you have a multi-tenant environment, a 10 GB network interface card (NIC). If you don’t plan to use the WebHDFS service extensively, you can use a 1 GB NIC instead.

    ·         Roles for the edge node have to be installed.

                For HDP:

    a)      The Hadoop client is installed.

    b)      If the HDP cluster has the Spark service, the Spark client is installed.

    c)      If the HDP cluster has the Spark2 service, the Spark2 client is installed.

    d)      If your cluster is Kerberized, copy the spnego keytab file to the /etc/security/keytabs/spnego.service.keytab directory.

     For CDH:

    a)      The Hadoop Gateway Role is installed.

    b)      If the CDH cluster has the Spark service, the Spark Gateway Role is installed.

    c)      If the CDH cluster has the Spark2 service, the Spark2 client is installed.

     

    ·         The node has an available port for the Hadoop integration software. The port should be exposed for access from IBM Cloud Pak for Data clusters that need to connect to the Hadoop cluster.

    ·         The node has an available port for the Hadoop integration software REST service. The port does not need to be exposed for external access.

    ·         The node has an available port for Livy for Spark or Livy for Spark2, depending on the service that needs to be exposed. The port does not need to be exposed for external access.

    ·         If your cluster is Kerberized, have the keytab file for the service user and the spnego keytab file for the edge node. This eliminates the need for each IBM Cloud Pak for Data user to have a valid keytab file.

     

    To check ports availability, the following command should return no results:

    netstat -ano | egrep ‘8443|8082|8998|8999’

     

    When setting up an edge node, besides creating a new one, you can also set up with an existing data node.

    3. DNS service name resolution for the Hadoop cluster configured in the Cloud Pak for Data cluster

    In this Cloud Pak for Data cluster, DNS service name resolution for the hadoop cluster should be configured.

    After that Cloud Pak for Data cluster should be enabled to use that external DNS service. If we do not set up DNS we will not be able to push the image from WSL to Hadoop.

     

    Create a service user

    A service user is required in DSXHI and here we named him as ‘dsxhi’.

    The dsxhi user and also users configured in Data Science Experience or Cloud Pak for Data must be known on the Hadoop cluster, and they must have a home directory in the HDFS. Hadoop requires that all users in the cluster must have the same uid, hence we specify this when creating the Linux account.

    To create the dsxhi user, execute the following on all Hadoop nodes:

    useradd -u 3001 -g hdfs dsxhi
    Subsequently, create the dsxhi HDFS home directory (as user hdfs):

    hdfs dfs -mkdir /user/dsxhi

    hdfs dfs -chown dsxhi:hdfs /user/dsxhi

    Preparing the Hadoop cluster for DSXHI

    The DSXHI services will connect to the Hadoop cluster with special privileges to impersonate other users to ensure that YARN jobs are submitted on behalf of the user and that HDFS authorization settings are respected. In the examples below we assume that user is called “dsxhi”.

    1.For the HDFS component the following settings must be applied to core-site.

    1)In HDP, this configuration can be set in HDFS -> Configs a Advanced -> Custom core-site.

    hadoop.proxyuser.dsxhi.groups=*

    hadoop.proxyuser.dsxhi.hosts=*

    Untitled1

    2)In CDH, this configuration can be set in HDFS -> Configuration -> Advanced -> Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml.

    Untitled2

    2.If you plan to access Hive tables, you must add dsxhi as a WebHCat proxy user.

    1) In HDP, you can set this configuration in Hive -> Configs -> Advanced -> Custom webhcat-site.

    webhcat.proxyuser.dsxhi.hosts=*

    webhcat.proxyuser.dsxhi.groups=*
    Example:

    Untitled3

    2)In CDH, this configuration can be set in Hive -> Configuration -> Advanced -> WebHCat Server Advanced Configuration Snippet (Safety Valve) for webhcat-site.xml.

    Untitled4

     

    3)Finally, if you intend to push down Spark jobs to the Hadoop cluster, you must define dsxhi as the super-user for Spark.

    In HDP, you can configure Spark or Spark2 ->Configs ->Advanced -> Custom livy2-conf.

    livy.superusers=dsxhi
    Example:

    Untitled5

    In CDH, the DSXHI can create its own instance of the Livy for Spark and Livy for Spark2 services in case these have not been implemented as part of the Hadoop installation.

     

    4) If your cluster is kerberized, then generate dsxhi.keytab using the below commands. Copy the keytab to all other HDP nodes.

     

    kadmin.local -q “addprinc -randkey dsxhi@IBM.COM”

    kadmin.local -q “ktadd -norandkey -k /etc/security/keytabs/dsxhi.keytab dsxhi@IBM.COM “

     

    When you have set all properties, you must restart the services.

     
    Installing the DSXHI package

    You will need to install the DSXHI package on one of the edge nodes of the Hadoop cluster.

    Copy the Hadoop integration software RPM file, dsxhi_platform.rpm, to the edge node. For example, if your node is running on a x86-64 hardware, copy the dsxhi_x86_64.rpm file.

    To install the package, use yum.

    yum install -y dsxhi_x86_64.rpm
     
    Configure DSXHI

    DSXHI includes sample configurations for Hortonworks (HDP) and Cloudera (CDH). Copy the template to dsxhi_install.conf.

    cp /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.HDP /opt/ibm/dsxhi/conf/dsxhi_install.conf
    In the below snippet, we have only marked the changed properties. In the example, DSXHI is configured to run the Livy and Livy for Spark 2 service on the edge node.

    dsxhi_license_acceptance=a

    dsxhi_serviceuser=dsxhi

    dsxhi_serviceuser_group=hdfs

     

    # dsxhi_serviceuser_keytab=/etc/security/keytabs/dsxhi.keytab

    # dsxhi_spnego_keytab=/etc/security/keytabs/spnego.service.keytab

    cluster_manager_url=http://paying1.fyre.ibm.com:7180

    cluster_admin=admin

     

    existing_webhcat_url=http://paying1.fyre.ibm.com:50111/templeton/v1

     

    # existing_livyspark_url= <– not to be used, using Livy for Spark server as part of dsx-hi

    # existing_livyspark2_url= <– not to be used, using Livy for Spark2 server as part of dsx-hi

     

    dsxhi_livyspark_port=8998

    dsxhi_livyspark2_port=8999

     

    # known_dsx_list=https://dsxlcluster1.ibm.com,https://dsxlcluster2.ibm.com:31843
    Once the properties have been configured, the install script can be run. This will configure and also start the required services.

    cd /opt/ibm/dsxhi/bin

    ./install.py

    Output:

    IBM Data Science Experience Hadoop Integration Service

    ——————————————————

    Terms and Conditions: http://www14.software.ibm.com/cgi-bin/weblap/lap.pl?la_formnum=&li_formnum=L-KLSY-AVZW2D&title=IBM+Data+Science+Experience+Hadoop+Integration&l=en

     

    –Determining properties

    –Running the prechecks

     

    –Install Livy for Spark / Spark 2

    –Configure gateway

    –Create template for gateway

    –Setting up known DSX Local clusters

    –Install dsxhi_rest

    –Start all services

    –Install finished. Check status in /var/log/dsxhi/dsxhi.log
     

    After the RPM is successfully installed the following components are started:

     

    Untitled6

    Register the Cloud Pak for Data service within DSXHI

    To prevent “any” Cloud Pak for Data from accessing the  services, a set of certificates must be exchanged. DSXHI  provides a command that will register a DSX/Cloud Pak for Data cluster.

    cd /opt/ibm/dsxhi/bin

    ./manage_known_dsx.py –add https://<ICP4D instance IP>:31843
     

    Output:

    Successfully added:

    DSX Local Cluster URL                                                    DSXHI Service URL

    https://<hostname _ icp4d_instance>:31843                

    https://<hdp hostname>:8443/gateway/9.30.0.211

    In the above example, the https:// address is the URL for accessing Cloud Pak for Data.

    Register the Hadoop cluster within Cloud Pak for Data

    Now that the data science service has been registered in , you can access Hadoop from Cloud Pak for Data. In Cloud Pak for Data, select “Administer” from the left menu and then choose “Hadoop integration”.

    Untitled7

    Click on “Add Registration”, pick a name and enter the DSXHI Service URL in the “Service URL” field. Finally, select the user (dsxhi) you will use to log on to the Hadoop cluster and click Add.

    Untitled8

    You should see the following message: “Successfully registered the system.”

    When you click on the three dots under Actions, you can find the details and which service endpoints are exposed.

    Untitled9

    On Ambari UI, set below config on Ranger KMS > Custom kms site

     

    hadoop.kms.proxyuser.dsxhi.hosts=*

    hadoop.kms.proxyuser.dsxhi.groups=*

     

    Ensure spark.ssl.keyStore and spark.ssl.trustStore on Spark2 > Custom spark2-defaults are directing to correct jks path(fci_universal_ks.jks). If not, change accordingly

     

    This completes the setup  and integration with the data science component.

    Now that the Hadoop integration has been set up, you can access HDFS and Hive data sources from within a notebook and also push down processing to the Hadoop cluster.

     

  2. Troubleshooting

    1. DSXHI rest server in error status

    Untitled10

    How to resolve this problem:

    Make sure the JDK on the edge node supports AES256 Cryptography. It’s better the JDK version equals to or is newer than JDK 8.

    2. Registration failure Issue 1 – gethostbyname failure

    Untitled11

    How to resolve this problem:

    1)Get the utils-api pod name.

    kubectl get pods -n zen | grep utils-api

     

    2)If the above command returns utils-api-86568c499b-rjlqk, then execue the following command for checking the logs of the utils-api pod.

    kubectl logs utils-api-86568c499b-rjlqk -n zen

     

    3)If there’s error message saying ‘add_endpoint : ERROR : gethostbyname failure

    connect:errno=22’

    a)      Execute the following command and log into the utils-api pod.

    kubectl exec -it utils-api-86568c499b-rjlqk -n zen bash

    b)      ping the edge node with host name

    ping hostname_edgenode

    If it’s not pingable, then you need to modify the /etc/host file and add the host name of the edge node. And then try again.

     

    3. Registration failure Issue 2 – Keystore file does not exist

    Untitled11

    How to resolve this problem:

    1) Check the logs of the utils-api pod.

    kubectl logs utils-api-86568c499b-rjlqk -n zen

     

    2) If there’s error message saying ‘keytool error: java.lang.Exception: Keystore file does not exist: /user-home/_global_/security/customer-truststores/cacerts’

    Execute the following command to delete the utils-api pod and then try again.

    kubectl delete utils-api-86568c499b-rjlqk -n zen

     

    4. Registration failure Issue 3 – Keystore file exists, but is empty

    Untitled11

     

    How to resolve this problem:

    1) Check the logs of the utils-api pod.

    kubectl logs utils-api-86568c499b-rjlqk -n zen

     

    2) If there’s error message saying ‘keytool error: java.lang.Exception: Keystore file exists, but is empty: /user-home/_global_/security/customer-truststores/cacerts’

    Execute the following command to delete the utils-api pod and then try again.

    kubectl delete utils-api-86568c499b-rjlqk -n zen

  3. Reference

    1. DSXHI service related logs for problem determination

    Service                                                         Log location
    Hadoop integration software gateway              /opt/ibm/dsxhi/gateway/logs
    Hadoop integration software REST server        /var/log/dsxhi
    Livy for Spark                                               /var/log/livy
    Livy for Spark2                                             /var/log/livy2
     

    2. How to check logs when the registration fails in Cloud Pak for Data
    If the registration fails, view the kubectl logs of the utils-api pod in Cloud Pak for Data cluster .

    1)Get the utils-api pod name.

    kubectl get pods -n zen | grep utils-api

     

    2)If the above command returns utils-api-86568c499b-rjlqk, then execute the following command.

    kubectl logs utils-api-86568c499b-rjlqk -n zen

     

    3. Knowledge Center

    Integrating with a Hadoop cluster:

    https://www.ibm.com/support/knowledgecenter/SSQNUZ_current/com.ibm.icpdata.doc/zen/install/hdp.html?view=embed#hdp__option-1

1 comment on"I CAN – Integrate IBM Cloud Pak for Data with Hadoop (For HDP & CDH)"

  1. mverma17.201908071330 October 07, 2019

    This is v. useful. Thanks Manoj for compiling and sharing the recipe.

Join The Discussion