IBM Support

Ambari Pig View in Kerberos Enabled Clusters - Hadoop Dev

Technical Blog Post


Abstract

Ambari Pig View in Kerberos Enabled Clusters - Hadoop Dev

Body

Overview
Ambari Views provide a user-interface to make use of various functionalities of the services installed on the Hadoop cluster. Apache Pig is used to process and analyze large data sets via Pig scripts. Ambari’s Pig view allows users to create, edit and submit these Pig scripts. To provide a secured authentication of requests for services, Kerberos may be enabled on a Hadoop cluster. With Kerberos enabled, there are a few more steps required to configure the Ambari Pig view. This blog post provides the necessary steps involved to successfully configure Pig View in IOP 4.2 release.

Steps:

  1. If you have Kerberized your cluster through Ambari, then to successfully use Pig views, you also need to Kerberize ambari-server as follows:

    1. On the ambari-server host, go into the following directory:
      cd /etc/security/keytabs

    2. Log into kadmin tools as follows test:
      kadmin -p <admin principal name from your KDC>
      and use the corresponding admin principal password.

      img1

    3. Generate an ambari-server principal which will be used for the ambari-server keytab as follows:
      addprinc -randkey <ambari-server-principal-name>@<REALM>
      where,
      <ambari-server-principal-name> is the principal name of your choice
      <REALM> is the realm name of your KDC server
      e.g. ambari-server-principal@ABC.COM
      This principal name is also used for proxyuser of WebHDFS Authentication in step 5.2

      img2

    4. Generate ambari-server keytab as follows:
      xst -k <ambari-server-keytab-name>.keytab <ambari-server-principal-name>@<REALM>
      where,
      <ambari-server-keytab-name> is the keytab name of your choice, e.g. ambari.server.keytab
      <REALM> is the name of your realm for KDC server

      img3_copy

    5. Exit from kadmin tool by typing q (for quit). You should see the keytab file in path /etc/security/keytabs/ambari.server.keytab

      img4_copy

    6. Stop ambari-server as follows:
      ambari-server stop

    7. Setup security on ambari-server as follows:
      ambari-server setup-security

      • select option 3
      • enter the Kerberos principal name used for ambari-server keytab in step 1.3
        e.g. if the principal used in step 1.3 is ambari-server-principal@<REALM>, then you have to enter ambari-server-principal.
      • enter the absolute path of the keytab generated in step 1.4
        e.g. /etc/security/keytabs/.keytab
    8. Start ambari-server as follows:
      ambari-server start

      img5_copy

  2. Log into Ambari UI as an admin user.
  3. Go to Manage Ambari following the path admin->Manage Ambari as shown below:

    img6_copy

  4. Create a Pig view instance following the path Views->PIG->Create instance as shown below:

    img7

  5. Enter the view configuration details for the Create instance step as follows:

    1. Under Details, enter:
      • Instance Name e.g. “pig_view”
      • Display name e.g. “Pig View”
      • Description for the view e.g. “This Pig view is used to process user’s data set”
      • img8

    2. Under Settings:
      • If Kerberos is not enabled for the cluster, then no change is required here.
        img9

      • If Kerberos is enabled, then:
        • set WebHDFS Authentication as:
          auth=KERBEROS;proxyuser=<ambari-server-principal-name>
          where, <ambari-server-principal-name> is the Principal Name used when the ambari-server is Kerberized (refer to step-1).
        • set WebHCat Username as:
          ${username}

          img10

    3. Under Cluster Configuration:
      • If Kerberos is not enabled for the cluster, then select Local Ambari Managed Cluster.
        img11
      • If Kerberos is enabled, then select Custom and:
        • set WebHDFS File System URI as:
          webhdfs://<value-of-http-address>
          where, <value-of-http-address> is the value of the property Services->HDFS->Configs->Advanced->Advanced hdfs-site->dfs.namenode.http-address
        • set WebHCat Hostname as:
          Enter the value of the property Services->Hive->Configs->Advanced->WebHCat Server->WebHCat Server Host
        • set WebHCat Port as:
          Enter the value of the property Services->Hive->Configs->Advanced->Advanced webhcat-site->templeton.port
        • If your cluster has NameNode High Availability, then set the following fields as follows:
          • set Logical name of the NameNode cluster with value from:
            Services->HDFS->Custom hdfs-site->dfs.nameservices
          • set List of NameNodes with value from:
            Services->HDFS->Custom hdfs-site->dfs.ha.namenodes.<logical name of NameNode>
          • set First NameNode RPC Address with value from:
            Services->HDFS->Custom hdfs-site->dfs.namenode.rpc-address.<logical name of NameNode>.nn1
          • set Second NameNode RPC Address with value from:
            Services->HDFS->Custom hdfs-site->dfs.namenode.rpc-address.<logical name of NameNode>.nn2
          • set First NameNode HTTP (WebHDFS) Address with value from:
            Services->HDFS->Custom hdfs-site->dfs.namenode.http-address.<logical name of NameNode>.nn1
          • set Second NameNode HTTP (WebHDFS) Address with value from:
            Services->HDFS->Custom hdfs-site->dfs.namenode.http-address.<logical name of NameNode>.nn2
          • set Failover Proxy Provider with value from:
            Services->HDFS->Custom hdfs-site->dfs.client.failover.proxy.provider.<logical name of NameNode>
          • img12

    4. Save the view instance.
    5. To give users or groups the permission to use this Pig view instance, under the Permissions section, add the desired users to Grant Permission to these users and add the desired groups to Grant Permission to these groups. Save the users/groups by clicking the tick mark.

      img13

  6. Create Proxy Users for HDFS:
    Ambari-Server communicates with HDFS File System to save and access data in the Hadoop distributed system. Pig view also communicates with the HDFS File System when a pig script is created and saved onto the file system, or when a pig script is run and standard output and error files are created to log the details, and these new files are all created and saved in the HDFS File System. All Pig jobs are launched using WebHCat, which are HCatalog RESTful APIs. HCatalog, part of Apache Hive, provides access to the data for components like Pig, Sqoop and MapReduce. So the communication with HDFS for Pig occurs through the user running WebHCat server.

    The HDFS File System must recognize the user communicating with it to allow that user access to the file system. For this reason, the users running ambari-server daemon service and WebHCat server must be allowed by HDFS. This is done by adding proxyusers in HDFS core-site.xml as follows:

    1. Go to Services->HDFS->Configs->Advanced->Custom core-site->Add Property
    2. Add proxyuser for hosts and groups running WebHCat server as follows:
      hadoop.proxyuser.<WebHCat user>.groups=*
      hadoop.proxyuser.<WebHCat user>.hosts=*
      where, <WebHCat user> is the value of the property Services->Hive->Configs->Advanced->Advanced hive-env->WebHCat User

      img14_copy

    3. Add proxyuser for hosts and groups running ambari-server daemon service.
      1. If the user running ambari-server is root, then add as follows:
        hadoop.proxyuser.root.groups=*
        hadoop.proxyuser.root.hosts=*

        img15

        img16

      2. If the user running ambari-server is non-root, e.q. ambari-server-user, then add as follows:
        hadoop.proxyuser.ambari-server-user.groups=*
        hadoop.proxyuser.ambari-server-user.hosts=*
    4. If Kerberos is enabled, then to make use of Pig view, the ambari-server must also be Kerberized separately as explained in step-1. Since the ambari-server makes use of the ambari-server keytab for all its communications, the keytab’s principal must also be added to the list of proxyusers in HDFS as follows:
      hadoop.proxyuser.<ambari-server-principal-name>.groups=*
      hadoop.proxyuser.<ambari-server-principal-name>.hosts=*
      where, <ambari-server-principal-name> is the Principal Name used while generating the ambari-server keytab in step-1.
      e.g. If the Principal Name used for the ambari-server keytab is testPrincipalName@<REALM>, then the <ambari-server-principal-name> will be testPrincipalName.

      img17

    5. Save the configs.
  7. Create Proxy Users for HIVE:
    Ambari-server communicates with WebHCat server and to allow the ambari-server user, WebHCat server must have a proxyuser set as follows:

    1. Go to Services->Hive->Configs->Advanced->Custom webhcat-site->Add Property
    2. Add proxyuser for hosts and groups running ambari-server daemon service.
      1. If the user running ambari-server is root, then add as follows:
        webhcat.proxyuser.root.groups=*
        webhcat.proxyuser.root.hosts=*

        img18

      2. If the user running ambari-server is non-root, e.q. ambari-server-user, then add as follows:
        webhcat.proxyuser.ambari-server-user.groups=*
        webhcat.proxyuser.ambari-server-user.hosts=*
    3. If Kerberos is enabled, add the ambari-server keytab’s Principal Name as proxyuser:
      webhcat.proxyuser.<ambari-server-principal-name>.groups=*
      webhcat.proxyuser.<ambari-server-principal-name>.hosts=*
      where, <ambari-server-principal-name> is the Principal Name used while generating the ambari-server keytab in step-1.
      e.g. If the Principal Name used for the ambari-server keytab is testPrincipalName@<REALM>, then the will be testPrincipalName.

      img19

    4. Save the configs.
  8. Hadoop’s HTTP web-consoles by default allows access without authentication. If Kerberos is enabled, then HTTP web-consoles must be configured to require Kerberos authentication using SPNEGO protocol as follows:
    1. Create a secret key to sign authentication tokens on a host as follows. Make sure to place this file in every host of the cluster:
      • dd if=/dev/urandom of=/etc/security/http_secret bs=1024 count=1
      • chown hdfs:hadoop /etc/security/http_secret
      • chmod 440 /etc/security/http_secret
    2. Modify the following property in:
      Services->HDFS->Configs->Advanced->Advanced core-site

      • hadoop.http.authentication.simple.anonymous.allowed = false
    3. Add/Modify the following properties in:
      Services->HDFS->Configs->Advanced->Custom core-site

      • hadoop.http.authentication.signature.secret.file = /etc/security/http_secret
      • hadoop.http.authentication.type = kerberos
      • hadoop.http.authentication.kerberos.keytab = /etc/security/keytabs/spnego.service.keytab
      • hadoop.http.authentication.kerberos.principal = HTTP/_HOST@<REALM>
        where, <REALM> is the realm name of your KDC server
      • hadoop.http.filter.initializers = org.apache.hadoop.security.AuthenticationFilterInitializer
      • hadoop.http.authentication.cookie.domain = xyz.com
        This property depends on Fully Qualified Domain Names of the servers in your cluster.
        e.g. If the host name is hostname1@xyz.com, then the property must be set to xyz.com
    4. Save the configuration.
  9. Restart all the affected services.
  10. The Pig view stores user metadata in HDFS. By default, the location used is /user/<user-name-of-logged-in-ambari-user>. Since many users leverage the admin account for getting started with Ambari, the /user/admin folder needs to be created on the host as follows:
    su - hdfs
    hadoop fs -mkdir /user/admin
    hadoop fs -chown admin:hadoop /user/admin

    img20

  11. If Kerberos is enabled, then care must be taken to ensure that the users allowed to use the Pig view must be normal users on the ambari-server host. This means that the users have a unique UID (User Identifier) and a GID (Group Identifier) associated with them. The Kerberos authentication requires Pig view users to have group information before submitting a job, and hence, having these users as normal users on the hosts becomes essential.
    You can create normal users as follows:

    1. Check if the user is present or not on the host as follows:
      id <username>
      where, <username> is the Ambari user allowed to use the Pig view
    2. Create the user as follows with primary group hadoop and secondary group of proxy-user:
      useradd -g hadoop -G <proxy-user-group> <username>
      where, the group <proxy-user-group> is the value of Services->HDFS->Configs->Advanced->Advanced hadoop-env->Proxy User Group. By default, the value of this property is users.
    3. Verify that the user is created using step 11.1
    4. img21

  12. Create and Execute a simple Pig script
    1. Open the Pig View instance by navigating to the Views dashboard as shown below:

      img22

    2. There are 3 pre-checks, viz. Storage test, HDFS test, WebHCat test, executed before successfully opening the Pig View instance. All the 3 tests must pass, as shown below, before the instance is allowed to open.

      img23

      In case the WebHCat test fails as shown below, make sure to kinit your SPNEGO keytab as shown below:

      img24

      img25

    3. Once the instance opens successfully, create a new Pig script by clicking New Script as shown below:

      img26

    4. Name the new script with appropriate name, then click the Create button:

      img27

    5. Upon successful creation of the Pig Script in HDFS, you will see the following screen:

      img28

    6. Edit the script and then click Save. The following is a sample Pig script used for testing:
      A = load '/tmp/passwd' using PigStorage(':');

      img29

    7. Execute the script by clicking the Execute button. Upon successful job submission the screen appears as follows:

      img30

    8. Upon successful completion of the Pig job, the following screen appears. In case of any failures, the Ambari-Server logs will have further details:

      img31

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16260111