IBM Support

How to install Tez on IOP 4.2 - Hadoop Dev

Technical Blog Post


Abstract

How to install Tez on IOP 4.2 - Hadoop Dev

Body

The following directions detail the manual installation of software into IBM Open Platform for Apache Hadoop. These directions, and any binaries that may be provided as part of this article (either hosted by IBM or otherwise), are provided for convenience and make no guarantees as to stability, performance, or functionality of the software being installed. Product support for this software will not be provided (including upgrade support for either IOP or the software described). Questions or issues encountered should be discussed on the BigInsights StackOverflow forum or the appropriate Apache Software Foundation mailing list for the component(s) covered by this article.

This blog post introduces you to some basic steps for installing and configuring Apache Tez on top of the IBM Open Platform with Apache Hadoop (IOP) 4.2. Examples show you how to run Tez, Hive, or Pig jobs that are using Tez as the execution engine.

Apache Tez is a data processing framework on top of Apache Hadoop YARN that was developed for building high performance batch and interactive data processing applications. Tez is faster than MapReduce despite being able to scale into the petabyte range. And when Tez is used as the execution engine instead of MapReduce, it has a positive effect on both Hive and Pig performance.

Before installing Tez, be sure that you have the following prerequisites:

  • Maven 3 or higher
  • Protocol Buffer 2.5.0, including the protocol buffer compiler

Note: For rpm-based Linux systems, the yum repository might not have Protocol Buffer 2.5.0, but you can download it from https://github.com/google/protobuf/releases/tag/v2.5.0.

Get the Tez source code

There are two ways of getting the Tez source code.

  • To get the source code from the Apache Git server, complete the following steps.
    1. Use the git command to clone the source code:
      git clone git://git.apache.org/tez.git  
    2. Check out the latest stable version:
      git checkout -b 0.8.2 rel/release-0.8.2  
  • To get the source code from the Apache site, complete the following steps.
    1. Get the Tez source package from http://www.eu.apache.org/dist/tez/0.8.2/apache-tez-0.8.2-src.tar.gz.
    2. Unpack the tar ball by running the following command:
      tar -vxzf apache-tez-0.8.2-src.tar.gz  
    3. Navigate to the source code folder:
      cd apache-tez-0.8.2-src  

When the source code is ready, use the following Maven command to package the tar balls:

mvn clean package -Dhadoop.version=2.7.2 -Dpig.version=0.15.0 -DskipTests  

You can then find the tar balls in tez-dist/target:tez-0.8.2.tar.gz and tez-0.8.2-minimal.tar.gz.

Install and configure Tez

  1. As user hdfs, place the Tez tar ball on the HDFS.
    su hdfs  hadoop fs -mkdir /path/to/tez  hadoop fs -put tez-0.8.2.tar.gz /path/to/tez  
  2. To configure Tez, complete the following steps, which are required on all Tez client nodes (from which Tez applications are submitted to the YARN Resource Manager).
    1. As root, create the configuration folders.
      su  mkdir -p /etc/tez/conf/  mkdir -p /usr/iop/4.2.0.0/tez  
    2. Create a tez-site.xml file under /etc/tez/conf/, and add the following lines to the file:
      <configuration>    <property>      <name>tez.lib.uris</name>      <value>${fs.defaultFS}/path/to/tez/tez-0.8.2.tar.gz</value>    </property>  </configuration>  

      Ensure that the property “tez.use.cluster.hadoop-libs” is not set in tez-site.xml, or that the value is set to false. This configuration file contains a comma-delimited list of the location of the Tez libraries. Specifying tez-0.8.2.tar.gz assumes that a compressed version of the Tez libraries is being used.

    3. Place the JAR files into a specific local folder:
      tar -vxzf tez-0.8.2-minimal.tar.gz -C /usr/iop/4.2.0.0/tez  

    Deploy Tez

    To deploy Tez, complete the following steps.

    1. Change Hadoop environment variables.
      1. Go to the Ambari administration page and click HDFS. On the Configs tab, under Advanced, find the hadoop-env.sh template and append the following lines:
        export TEZ_CONF_DIR=/etc/tez/conf/   export TEZ_JARS=/usr/iop/4.2.0.0/tez  export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*:${HADOOP_CLASSPATH}:   ${JAVA_JDBC_LIBS}:${MAPREDUCE_LIBS}  

        Do not omit “*”, and use “:” instead of “;” as the separator character.
        Hadoop uses this location on client nodes to find the tez-site.xml file and, consequently, the Tez libraries.
        Be sure to change the /usr/iop/4.2.0.0/hadoop/conf/hadoop-env.sh file on all nodes if you are deploying Tez without the Ambari server.

      2. The Restart service tab might appear on the page after you click Save; if so, click Restart all affected to restart the HDFS service.

    Configure the MapReduce service (optional)

    Complete the following steps only if you are planning to run all MapReduce applications on the Tez framework. Otherwise, skip to the next section.

    1. Click MapReduce2 on the Ambari administration page. On the Configs tab, under Advanced, find the field “mapreduce.framework.name” under Advanced mapred-site and change the value from “yarn” to “yarn-tez”. Be sure to change mapreduce.framework.name in /usr/iop/4.2.0.0/hadoop/conf/mapred-site.xml on all nodes if you are deploying Tez without the Ambari server.
    2. The Restart service tab might appear on the page after you click Save; if so, click Restart all affected to restart the MapReduce2 service.

    Run a quick test

    1. Log in as user hdfs on a Tez client node.
    2. Run the following command:
      hadoop jar /usr/iop/4.2.0.0/tez/tez-examples-0.8.2.jar wordcount   /hdfs/path/of/input/data /hdfs/path/of/output/folder

      Be sure to specify a nonexistent folder on the HDFS as the output folder.

    3. Go to the YARN web console and verify that the application completed successfully, as shown in the following example:

      Tez Success Sample

    Run Hive using Tez as the execution engine in the Hive console

    1. Switch to user hive:
      su - hive  
    2. Start the Hive command line:
      /usr/iop/4.2.0.0/hive/bin/hive  
    3. Set the execution engine:
      set hive.execution.engine=tez;  
    4. Run a Hive job:
      select key, count(*) from   (select x.key as key, y.value as value from srcpart x    join srcpart y on (x.key = y.key)    union all select key, value from srcpart z)    a join src b on (a.value = b.value)   group by a.key, a.value;

      The result is printed to the console, as shown in the following example:

        --------------------------------------------------------------------------------          VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  --------------------------------------------------------------------------------  Map 1 ..........   SUCCEEDED      1          1        0        0       0       0  Map 4 ..........   SUCCEEDED      1          1        0        0       0       0  Map 5 ..........   SUCCEEDED      1          1        0        0       0       0  Map 6 ..........   SUCCEEDED      1          1        0        0       0       0  Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0  --------------------------------------------------------------------------------  VERTICES: 05/05  [==========================>>] 100%  ELAPSED TIME: 6.76 s       --------------------------------------------------------------------------------  
    5. Go to the YARN web console and verify that the application completed successfully, as shown in the following example:

      success Hive on Tez job

    Run Pig using Tez as the execution engine in Pig

    Complete the following steps on each node that has both Pig and Tez clients installed:

    1. Navigate to the h2 folder.
      cd /usr/iop/4.2.0.0/pig/lib/h2
    2. Remove all Tez-related JAR files.
      rm -rf tez*.jar
    3. Create symbolic links that point to the Tez JAR files that are referenced in step 3.
      ln -s /usr/iop/4.2.0.0/tez/tez-api-0.8.2.jar tez-api-0.8.2.jar  ln -s /usr/iop/4.2.0.0/tez/tez-common-0.8.2.jar tez-common-0.8.2.jar  ln -s /usr/iop/4.2.0.0/tez/tez-dag-0.8.2.jar tez-dag-0.8.2.jar  ln -s /usr/iop/4.2.0.0/tez/tez-mapreduce-0.8.2.jar tez-mapreduce-0.8.2.jar  ln -s /usr/iop/4.2.0.0/tez/tez-runtime-internals-0.8.2.jar tez-runtime-internals-0.8.2.jar  ln -s /usr/iop/4.2.0.0/tez/tez-runtime-library-0.8.2.jar tez-runtime-library-0.8.2.jar  ln -s /usr/iop/4.2.0.0/tez/tez-yarn-timeline-history-with-acls-0.8.2.jar    tez-yarn-timeline-history-with-acls-0.8.2.jar  
    4. To run a Pig script in Tez mode, use pig -x tez .... The following message is printed to the console:
      INFO pig.ExecTypeProvider: Picked TEZ as the ExecType  

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16260063