There are two parts to setting up and running Streams on Yarn using PAM authentication: installing Yarn and then installing Streams and configuring it to use Yarn. All the steps are described in this article. Setup instructions are the same for Streams 4.0.1 and  Streams 4.1.

Installing and configuring Yarn

  1. Install Yarn
    Download latest version of hadoop from the following site and extract the compressed tar file to a folder say ~/YARN.  Testing was done using hadoop-2.7.0 and newer releases may work as well. After extracting the downloaded file, a folder called hadoop-2.7.0 will get created under ~/YARN. I have done this setup as yarn01 user.
    http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Download
  2. Set environment variables
    Add the following environment variables to ~/.bashrc. Customize these environment variables based on your settings.

    
    export JAVA_HOME=/opt/ibm/java-x86_64-71
    export HADOOP_HOME=<absolute path to hadoop-2.7.0>
    export HADOOP_PREFIX=$ HADOOP_HOME
    export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
    export YARN_HOME= $HADOOP_HOME
    export HADOOP_YARN_HOME=$HADOOP_PREFIX
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
    export PATH=$YARN_HOME/bin:$PATH
    export PATH=$JAVA_HOME/bin:$PATH
    

    Run the following command to set the environment variables: source ~/.bashrc

  3. Edit yarn-site.xml
    You need to edit $YARN_HOME/etc/hadoop/yarn-site.xml to change the following properties:

    
    <configuration>
       <!-- Site specific YARN configuration properties -->
       <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
       </property>
       <property>
          <name>yarn.nodemanager.vmem-check-enabled</name>
          <value>false</value>
       </property>
       <property>
          <name>yarn.application.classpath</name>
          <value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/* </value>
       </property>
    < /configuration>
    
  4. Edit hdfs-site.xml
    You need to add the hdfs directory that will be used by yarn. Please update the values based on your settings.

    
    <configuration>
       <property>
          <name>dfs.replication</name>
          <value>1</value>
       </property>
       <property>
          <name>dfs.namenode.name.dir</name>
          <value>file:/yarn/hdfs/hadoop-2.7.0/namenode</value>
       </property>
       <property>
          <name>dfs.datanode.data.dir</name>
          <value>file:/yarn/hdfs/hadoop-2.7.0/datanode</value>
       </property>
    </configuration>
    
  5. Setup ssh keys
    ssh keys should be setup between all the hosts that are part of YARN cluster so that the user can login to all the hosts without a passphrase.
  6. Format the file system
    Format the file system before starting NameNode and DataNode daemons using the following command:

    
    $YARN_HOME/bin/hdfs namenode –format 
    

    Note:The YARN logs are available by at $YARN_HOME/logs.

  7. Start yarn
    The following commands will start Yarn:

    
    $YARN_HOME/sbin/hadoop-daemon.sh start namenode
    $YARN_HOME/sbin/hadoop-daemons.sh start datanode
    $YARN_HOME/sbin/yarn-daemon.sh start resourcemanager
    $YARN_HOME/sbin/yarn-daemons.sh start nodemanager
    

Note:The YARN cluster can be monitored from the following url:http://<hostname>:8088/cluster/

Installing and Configuring Streams with Yarn Resource Manager

  1. Install Streams
    Install Streams on a host(host1) that is not part of Yarn cluster. I have used a VM that uses PAM authentication.
    Note:Streams installation should be done as a root user and the streams installation owner/group is set to yarn01/yarn01In the following steps the host where Streams was installed will be referred to as host1. Please modify the steps to reflect your environment.
  2. Set environment variables
    I have set the following two environment variables. So I do not have to pass domain-id and zkconnect with every streamtool command.

    
        export STREAMS_DOMAIN_ID=<domain-id>
        export STREAMS_ZKCONNECT=<hostn1:port,hostn2:port..>
        source /opt/ibm/InfoSphere_Streams/4.1.0.0/bin/streamsprofile.sh
    
  3. Start controller
    After the completion of successful installation on host1, run the following command as root to start controller as system service.
    Note:I have used sudo to run as root.

    
        sudo $STREAMS_INSTALL/bin/streamtool registerdomainhost --tags authentication,audit,jmx,sws
    

    NOTE: Please note that the –tags mentioned in the above command is very important for this to work correctly.

  4. Check the status of system service
    Check the status of the system service by running the following command:

    
        sudo $STREAMS_INSTALL/bin/streamtool getdomainhoststatus
    
  5. Install Streams on Yarn master node
    Now we will switch to yarn master node (also called name node). Install streams as root and specify streams installation user/group=yarn01/yarn01
    Note: Please customize based on your environment. I have used the same user/group as host1.
  6. Set Streams environment variables
    After the successful installation set the streams environment by source streamsprofile.sh as follows:

    
        source /opt/ibm/InfoSphere_Streams/4.1.0.0/bin/streamsprofile.sh
    
    
  7. Update streams-am.properties
    Update streams-am.properties to configure yarn container for Streams. If you installed Streams in /opt/ibm/InfoSphere_Streams the file is in etc/yarn.
    Update streams-am.properties file based on the machine configuration. If the system has a lot of memory and cores, then the default properties will work.Following is the configuration that I have used. This configuration will create each container with total 2GB memory.

    
        AM_QUEUE_NAME=default 
        AM_CORES=1
        AM_MEMORY=512
        DC_CORES=2
        DC_MEMORY=1024
        WAIT_SYNC_SECS=30
        WAIT_ASYNC_SECS=5
        WAIT_FLEXIBLE_SECS=5
        WAIT_HEARTBEAT_SECS=5
    
  8. Set the environment variables
    
         export STREAMS_DOMAIN_ID=<domain-id>
         export STREAMS_ZKCONNECT=<hostn1:port,hostn2:port..>
    
  9. Start Streams on Yarn
    Start Streams on Yarn using the following command:

    
    streams-on-yarn start --deploy
    

    You should see the following in the terminal:

    
    ….....
    ….....
    Streams App Master launched with Application ID: application_1435239389056_0003
    Waiting for application to start running...........done
    
  10. Check the status of streams-on-yarn
    To check the status of streams-on-yarn you can either use the browser or run a command.

    Check status with browser

    Open this URL in the browser: http://<hostname>:8088/cluster/

    Check status with command line

    
    yarn application -list
    15/08/21 13:51:03 INFO client.RMProxy: Connecting to ResourceManager at xxxxxxxxx
    15/08/21 13:51:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
    Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
    application_1435239389056_0003 Streams-AM-SVT_StreamsDomain_yarn_priya YARN yarn01 default RUNNING UNDEFINED 100% N/A
    
  11. Create a Streams domain.
    Make sure you specify the following properties when creating the domain:

    
     domain.externalResourceManager=yarn 
     domain.serviceStartTimeout=300 
     controller.startTimeout=300
     security.runAsRoot=true
    

    Run the following command to set the properties:

    
     streamtool mkdomain --property domain.externalResourceManager=yarn --property domainTrace.defaultLevel=trace --property sws.port=0 --property jmx.port=0 --property domain.serviceStartTimeout=300 --property controller.startTimeout=300 --property security.runAsRoot=true
    
  12. Generate streams authentication keys
    Run the following command:

    
     streamtool genkey
    
    
  13. Start the domain
    Run the following command to start the domain:

    
     streamtool startdomain –v
    
    
  14. Check status of the domain
    Check the status of the domain by running the following command and verify that domain services are all placed on host1 and not on yarn container.

    
     streamtool getdomainstate –l
    
    
  15. Create a Streams instance.
    I have created an instance with numresources=3 using the following command:

    
     streamtool mkinstance --numresources 3
    
    
  16. Start the instance
    Start the instance and check the status using the following commands:

    
    streamtol startinstance -v
    ...
    streamtool getresourcestate –l
    
  17. Launch Streams console
    To launch the Streams console you need to get the URL using this command:

    
     streamtool geturl
    

    Launch your browser with the URL returned to bring up the Streams console. You can now monitor your domain.

At this point you have successfully completed the steps to run Streams with Yarn using PAM authentication.

Utility Commands

The following are some useful commands that you may need:

  • Stop Streams domain
    To stop a Streams domain:

    
     streamtool stopdomain
    
    
  • To stop streams-on-yarn
    Issue the following command:

    
     STREAMS_INSTALL/bin/streams-on-yarn stop
    
    
  • To list hdfs files
    
        $ YARN_HOME/bin/hadoop fs -ls /
    
    
  • To remove hdfs files/folders
    
        $ YARN_HOME/bin/hadoop fs -rm -r /app*
    

6 comments on"Running Streams on Yarn using PAM Authentication"

  1. Brian M Williams September 29, 2016

    Why do you install streams in step #1 on a node that is not part of the yarn cluster and then in step 5 you install it on the yarn master node. I do not see where anything is ever done with the installation of Streams performed in Step #1. Can you please clarify.

    • Suvarchala Griddaluru September 29, 2016

      Hi Brian,

      In Step #3, we are starting the controller using the install that is done on a node that is not part of the yarn cluster.
      sudo $STREAMS_INSTALL/bin/streamtool registerdomainhost –tags authentication,audit,jmx,sws.

      If you are using PAM authentication, and PAM is setup to use password file, Streams will only be able to authenticate 1 user, the user that started the processes for Streams. If you want to authenticate any other users you have to configure Streams controller to run as root and that is what is done in the above step. So basically the controller on host1 takes care of user authentication.

      There may be other ways to accomplish the same thing. But this is a tested approach that surely worked.
      Hope this helps!

      • Brian M Williams October 02, 2016

        Thanks. The key concept for other readers is that the non-yarn host can be configured before the domain is created. This approach ensures that the authentication service and the two primary domain services that rely on it will be run outside of the yarn-controlled nodes (yarn cluster). This provides a couple of items:
        1) The authentication service will have pam available on host1 and does not need to be on the yarn nodes (users do not have to have accounts on all of the cluster nodes)
        2) The jmx and sws services host will be well-known and easier to find than if they were to be randomly assigned to containers in the yarn cluster
        When the domain is first started, this should be the only host in the domain. As the instances are created, additional hosts (resources) will be added for use by the instance from the yarn containers.

        Good article, thanks for the clarification.

      • Brian M Williams October 02, 2016

        Another question: Is the yarn01 user you mention the user that was used to install yarn as well as run IBM Streams domain? Does this users home directory need to be shared across the cluster? (e.g. ~/.streams directory available and accessed)

        • Suvarchala Griddaluru October 05, 2016

          Hi Brian,
          1) That’s correct. yarn01 user is used to install yarn and run streams domain.
          2) yarn01 user’s home directory does not have to be on a shared filesystem. In step #9 above streams-on-yarn is started with –deploy option (streams-on-yarn start –deploy ). “–deploy” option takes care of provisioning streams on all cluster nodes.

          Hope this helps!

Join The Discussion