There are two parts to setting up and running Streams on Yarn using PAM authentication: installing Yarn and then installing Streams and configuring it to use Yarn. All the steps are described in this article. Setup instructions are the same for Streams 4.0.1 and Streams 4.1.
Installing and configuring Yarn
- Install Yarn
Download latest version of hadoop from the following site and extract the compressed tar file to a folder say ~/YARN. Testing was done using hadoop-2.7.0 and newer releases may work as well. After extracting the downloaded file, a folder called hadoop-2.7.0 will get created under ~/YARN. I have done this setup as yarn01 user.
- Set environment variables
Add the following environment variables to ~/.bashrc. Customize these environment variables based on your settings.
export JAVA_HOME=/opt/ibm/java-x86_64-71 export HADOOP_HOME=<absolute path to hadoop-2.7.0> export HADOOP_PREFIX=$ HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export YARN_HOME= $HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_PREFIX export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export PATH=$YARN_HOME/bin:$PATH export PATH=$JAVA_HOME/bin:$PATH
Run the following command to set the environment variables: source ~/.bashrc
- Edit yarn-site.xml
You need to edit $YARN_HOME/etc/hadoop/yarn-site.xml to change the following properties:
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.application.classpath</name> <value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/* </value> </property> < /configuration>
- Edit hdfs-site.xml
You need to add the hdfs directory that will be used by yarn. Please update the values based on your settings.
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/yarn/hdfs/hadoop-2.7.0/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/yarn/hdfs/hadoop-2.7.0/datanode</value> </property> </configuration>
- Setup ssh keys
ssh keys should be setup between all the hosts that are part of YARN cluster so that the user can login to all the hosts without a passphrase.
- Format the file system
Format the file system before starting NameNode and DataNode daemons using the following command:
$YARN_HOME/bin/hdfs namenode –format
Note:The YARN logs are available by at $YARN_HOME/logs.
- Start yarn
The following commands will start Yarn:
$YARN_HOME/sbin/hadoop-daemon.sh start namenode $YARN_HOME/sbin/hadoop-daemons.sh start datanode $YARN_HOME/sbin/yarn-daemon.sh start resourcemanager $YARN_HOME/sbin/yarn-daemons.sh start nodemanager
Note:The YARN cluster can be monitored from the following url:http://<hostname>:8088/cluster/
Installing and Configuring Streams with Yarn Resource Manager
- Install Streams
Install Streams on a host(host1) that is not part of Yarn cluster. I have used a VM that uses PAM authentication.
Note:Streams installation should be done as a root user and the streams installation owner/group is set to yarn01/yarn01In the following steps the host where Streams was installed will be referred to as host1. Please modify the steps to reflect your environment.
- Set environment variables
I have set the following two environment variables. So I do not have to pass domain-id and zkconnect with every streamtool command.
export STREAMS_DOMAIN_ID=<domain-id> export STREAMS_ZKCONNECT=<hostn1:port,hostn2:port..> source /opt/ibm/InfoSphere_Streams/126.96.36.199/bin/streamsprofile.sh
- Start controller
After the completion of successful installation on host1, run the following command as root to start controller as system service.
Note:I have used sudo to run as root.
sudo $STREAMS_INSTALL/bin/streamtool registerdomainhost --tags authentication,audit,jmx,sws
NOTE: Please note that the –tags mentioned in the above command is very important for this to work correctly.
- Check the status of system service
Check the status of the system service by running the following command:
sudo $STREAMS_INSTALL/bin/streamtool getdomainhoststatus
- Install Streams on Yarn master node
Now we will switch to yarn master node (also called name node). Install streams as root and specify streams installation user/group=yarn01/yarn01
Note: Please customize based on your environment. I have used the same user/group as host1.
- Set Streams environment variables
After the successful installation set the streams environment by source streamsprofile.sh as follows:
- Update streams-am.properties
Update streams-am.properties to configure yarn container for Streams. If you installed Streams in /opt/ibm/InfoSphere_Streams the file is in etc/yarn.
Update streams-am.properties file based on the machine configuration. If the system has a lot of memory and cores, then the default properties will work.Following is the configuration that I have used. This configuration will create each container with total 2GB memory.
AM_QUEUE_NAME=default AM_CORES=1 AM_MEMORY=512 DC_CORES=2 DC_MEMORY=1024 WAIT_SYNC_SECS=30 WAIT_ASYNC_SECS=5 WAIT_FLEXIBLE_SECS=5 WAIT_HEARTBEAT_SECS=5
- Set the environment variables
export STREAMS_DOMAIN_ID=<domain-id> export STREAMS_ZKCONNECT=<hostn1:port,hostn2:port..>
- Start Streams on Yarn
Start Streams on Yarn using the following command:
streams-on-yarn start --deploy
You should see the following in the terminal:
…..... …..... Streams App Master launched with Application ID: application_1435239389056_0003 Waiting for application to start running...........done
- Check the status of streams-on-yarn
To check the status of streams-on-yarn you can either use the browser or run a command.
Check status with browser
Open this URL in the browser: http://<hostname>:8088/cluster/
Check status with command line
yarn application -list 15/08/21 13:51:03 INFO client.RMProxy: Connecting to ResourceManager at xxxxxxxxx 15/08/21 13:51:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Total number of applications (application-types:  and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL application_1435239389056_0003 Streams-AM-SVT_StreamsDomain_yarn_priya YARN yarn01 default RUNNING UNDEFINED 100% N/A
- Create a Streams domain.
Make sure you specify the following properties when creating the domain:
domain.externalResourceManager=yarn domain.serviceStartTimeout=300 controller.startTimeout=300 security.runAsRoot=true
Run the following command to set the properties:
streamtool mkdomain --property domain.externalResourceManager=yarn --property domainTrace.defaultLevel=trace --property sws.port=0 --property jmx.port=0 --property domain.serviceStartTimeout=300 --property controller.startTimeout=300 --property security.runAsRoot=true
- Generate streams authentication keys
Run the following command:
- Start the domain
Run the following command to start the domain:
streamtool startdomain –v
- Check status of the domain
Check the status of the domain by running the following command and verify that domain services are all placed on host1 and not on yarn container.
streamtool getdomainstate –l
- Create a Streams instance.
I have created an instance with numresources=3 using the following command:
streamtool mkinstance --numresources 3
- Start the instance
Start the instance and check the status using the following commands:
streamtol startinstance -v ... streamtool getresourcestate –l
- Launch Streams console
To launch the Streams console you need to get the URL using this command:
Launch your browser with the URL returned to bring up the Streams console. You can now monitor your domain.
At this point you have successfully completed the steps to run Streams with Yarn using PAM authentication.
The following are some useful commands that you may need:
- Stop Streams domain
To stop a Streams domain:
- To stop streams-on-yarn
Issue the following command:
- To list hdfs files
$ YARN_HOME/bin/hadoop fs -ls /
- To remove hdfs files/folders
$ YARN_HOME/bin/hadoop fs -rm -r /app*