Overview

Skill Level: Any Skill Level

This recipe provides a guide for configuring IBM InfoSphere Information Server DataStage to connect to Hive using DataDirect's JDBC driver as a source or target within a job design.

Ingredients

IBM InfoSphere Information Server 9.1, 11.3, or 11.5 (11.5 or later preferred)

Hive Server

DataDirect JDBC Driver (bundled with IBM InfoSphere Information Server)

Other third party JDBC drivers may be used, but configuration steps for the JDBC URL, etc may vary.  Please see the driver vendor's documentation for details.

Step-by-step

  1. Hive JDBC Driver

    It is important to obtain the latest DataDirect Hive JDBC driver to benefit from known fixes and performance improvements.

    To get the current version of your Hive driver:

    java -cp <location_of_IShive.jar> com.ibm.isf.jdbc.hive.HiveDriver

    IShive.jar is typically located in: /opt/IBM/InformationServer/ASBNode/lib/java so the full command would be:

    java -cp /opt/IBM/InformationServer/ASBNode/lib/java/IShive.jar com.ibm.isf.jdbc.hive.HiveDriver

    JR56156 is a more recent version of the driver and can be obtained by installing 11.5 Fix Pack 2 or contacting IBM Support.

  2. IBM JDK

    All stages: JDBC Connector, Hive Connector, and File Connector (Hive Table create option) rely on the IBM JDK for the JVM that is used to establish the JDBC connection.

    It is important to obtain the latest version of the IBM JDK to benefit from known fixes and security vulnerabilities.

    To upgrade the IBM JDK that is used with IBM InfoSphere Information Server, see one of the security bulletins

    Note that IBM Fix Central will display the latest IBM JDK version that is certified by IBM InfoSphere Information Server, which might supersede the fix mentioned in the security bulletin.

    To upgrade the IBM JDK used with Information Server, follow these steps:

    1. Extract the downloaded JDK:
      tar xvf ibm-java-sdk* --directory /opt/IBM/InformationServer
    2. Backup the original IBM JDK:
      mv /opt/IBM/InformationServer/jdk /opt/IBM/InformationServer/jdk-`date +"%Y%m%d%H%M"`
    3. Move the new JDK in place:
      mv /opt/IBM/InformationServer/ibm-java* /opt/IBM/InformationServer/jdk
    4. [Optional] If you are using Kerberos, or had previously applied the unrestricted export policy files, please reapply them now.

     

    Please consult the README packaged with the IBM JDK for additional information as well as instructions for upgrading the WebSphere JDK.

  3. isjdbc.config

    The isjdbc.config file controls the CLASSPATH and CLASSNAMES that are loaded by all JDBC related stages.

    Sample contents for the isjdbc.config:

    CLASSPATH=/opt/IBM/InformationServer/ASBNode/lib/java/IShive.jar
    CLASS_NAMES=com.ibm.isf.jdbc.hive.HiveDriver

    Additional entries can be added to both the CLASSPATH and CLASS_NAMES variables in the isjdbc.config by adding a ; delimiter.

  4. JAAS Configuration File (Kerberos)

    If your JDBC connection does not require Kerberos, you can skip this step and proceed to step #6

    The JDBCDriverLogin.conf file is a configuration file that specifies which Java Authentication and Authorization Service (JAAS) login module to use for Kerberos authentication.  By default, the file is read from: /opt/IBM/InformationServer/ASBNode/lib/java

    The JDBCDriverLogin.conf file can be loaded from a custom directory by setting the JVM option:

    -Djava.security.auth.login.config=/path/to/JDBCDriverLogin.conf

    By default, the krb5.conf (Kerberos configuration file) will be read from /etc/krb5.conf.  This location can be overwritten by setting the JVM option:

    -Djava.security.krb5.conf=/path/to/krb5.conf

    These JVM options can be set by setting the environment variable:

    CC_JVM_OPTIONS

    Or in the case of the JDBC Connector or Hive Connector, can be set directly in the stage under Java Properties.

    StageJVMOptions

    Sample JDBCDriverLogin.conf file:

    JDBC_DRIVER_keytab{
    com.ibm.security.auth.module.Krb5LoginModule required
    credsType=both
    principal="dsadm@IBM.COM"
    useKeytab="FILE:/home/dsadm/dsadm.keytab";
    };

    There are different options available for the JDBCDriverLogin.conf file, for instance instead of a Kerberos Keytab, you can use a Kerberos Credential Cache:

    JDBC_DRIVER_cache{
    com.ibm.security.auth.module.Krb5LoginModule required
    credsType=initiator
    principal="dsadm@IBM.COM"
    useCcache="FILE:/tmp/krb5cc_13251";
    };

    Where /tmp/krb5cc_13251 is the location of the Kerberos Credential Cache as seen by the output of a command such as klist:

    Ticket cache: FILE:/tmp/krb5cc_13251
    Default principal: dsadm@IBM.COM
    Valid starting Expires Service principal
    07/06/17 01:30:02 07/07/17 01:30:02 krbtgt/IBM.COM@IBM.COM
    renew until 07/06/17 01:30:02

    In either case, the principal provided should be a user principal, i.e. the user you are obtaining a Kerberos ticket with.  The Hive Service Principal can be provided in the JDBC URL and will be discussed in step #5.

    Multiple entries can be defined in the JDBCDriverLogin.conf file.  If multiple entries are defined, the JDBC URL option loginConfigName can be set to control multiple entries.  For example, to use the entry that has a keytab, the JDBC URL might look like:

    jdbc:ibm:hive://hiveserver.ibm.com:10000;AuthenticationMethod=kerberos;ServicePrincipalName=hive/hiveserver.ibm.com@IBM.COM;loginConfigName=JDBC_DRIVER_keytab

    JDBC URL syntax will be covered at length in step #6 below.

  5. Hive Service Principal (Kerberos)

    If your JDBC connection does not require Kerberos, you can skip this step and proceed to step #6

    The Hive Service Principal is typically needed for the ServicePrincipalName JDBC URL option. 

    The service principal name is typically:

    hive/[hive-server]@REALM.COM

    Where:

    hive is your service name.

    [hive-server] is the fully qualified domain name of the Hive Server.

    REALM.COM is your Kerberos realm.

    Otherwise, you should be able to find your service principal by looking at the value of the hive.server2.authentication.kerberos.principal property in the hive-site.xml file.

  6. Hive JDBC Driver URL Syntax

    Full JDBC options for the DataDirect driver, can be found within their documentation.

    The most important aspect to remember is that because the DataDirect JDBC drivers are embedded/provided by IBM InfoSphere Information Server, they are branded.  This means the beginning of the JDBC URL will be:

    jdbc:ibm:hive://

    Some basic JDBC URL examples are included below:

    Sample URL (Non-Kerberos)

    jdbc:ibm:hive://hiveserver.ibm.com:10000

    Sample URL (Kerberos)

    jdbc:ibm:hive://hiveserver.ibm.com:10000;AuthenticationMethod=kerberos;ServicePrincipalName=hive/hiverserver.ibm.com@IBM.COM

     

    Note:

    Hive allows for two methods for hive.server2.transport.mode.  Currently, DataDirect only supports binary mode (TCPIP) which is the default for Hive. HTTP/S support will be added a later date.

     

    This concludes the JDBC configuration/setup, now we are ready to proceed to configuring/using either the Hive Connector, File Connector Hive Table Create Option, or the JDBC Connector stages as will be detailed in the remaining steps.

  7. Hive Connector

    The Hive Connector is the preferred method for connecting to Hive via JDBC.  The initial release of the Hive Connector was on 11.5

    One of the most powerful features of the Hive Connector is it’s ability to perform partitioned reads and partitioned writes which allows you to extract/load data in parallel.

    The Hive Connector also includes support for creating tables with different Row formats and data storage formats:

    HiveConnectorRowFormatHiveConnectorDataFormat

    The default stage properties for a source Hive Connector stage are shown below:

    HiveConnectorStageProperties

  8. File Connector (Hive Table create option)

    The File Connector has an option to create an External Hive table on top of HDFS data that it loads.  This is one of the fastest ways to load into Hive.  The Hive Table create option leverages a JDBC Connection, some sample properties from the File Connector stage are shown below.  All the previous steps in this recipe are relevant to establish connectivity for the Hive Table Create option within the File Connector.

    The File Connector Hive Table create option does not have a “visible” JDBC URL.¬† Although additional JDBC options can be passed to the constructed URL by setting the “Additional driver attributes” option within the stage properties.

    HiveTableCreateOptionJDBCOptions

  9. JDBC Connector

    The JDBC Connector can be used with the same URL we constructed in step #6.¬† The main difference with this stage, is that in some cases, queries may need to be “user-defined”.

    Some default stage properties for a source JDBC Connector stage are shown below:

    JDBCConnectorStageProperties

  10. Debugging

    When debugging Kerberos configuration issues, setting the following JVM options help obtain a Kerberos trace:

    -Dcom.ibm.security.jgss.debug=all -Dcom.ibm.security.krb5.Krb5Debug=all

    The easiest way to set these for the JVM is to use the CC_JVM_OPTIONS environment variable.

    When debugging generic Java issues or SSL issues, the following JVM option can be helpful:

    -Djavax.net.debug=all

    All three stages (Hive Connector, JDBC Connector, and File Connector) support the CC_MSG_LEVEL environment variable which will provide additional information into the job log.

3 comments on"JDBC Hive Connectivity with IBM InfoSphere Information Server"

  1. DM_gm_usa July 19, 2017

    Thanks Scott for the nice Tutorial.

  2. Just found this…nicely done Scott…thank you.

  3. Suneel Kalapala May 03, 2018

    Nice article ..Thanks Scott..

Join The Discussion