The Spark distribution in IBM® Open Platform with Apache Hadoop is built with YARN support. This means in addition to the default mode of running Spark application locally, there are two additional deploy modes that can be used to launch Spark applications on YARN. In this post, we demonstrate the differences in those run modes.

The easiest way to run your Spark application is to use the spark-submit program usually found in:

 /usr/bin/spark-submit

To launch a Spark application, you run:

 $ spark-submit --class path.to.your.Class [--master mode] [options]  [app options]

For example:

 
$ spark-submit --name TPCDSHiveOnSpark 
    --class org.apache.spark.examples.sql.hive.TPCDSHiveOnSpark \
    --master yarn-client \
    --num-executors 100 \
    --executor-memory 4096m \
    /TestAutomation/sbt/target/scala-2.10/tpcdshiveonspark_2.10-1.2.1.jar \
    tpcds1000g \
    /TestAutomation/sbt/src/main/hive-queries/query91.sql

Local mode (default)

By default, there is no ‘slaves’ file under ‘spark/conf’ directory, so the launch script defaults to a single machine (localhost), which is useful for testing and demos. You will submit a Spark job via a spark-submit command without the --master parameter, the job will run locally on the driver host machine, for example:

 
$ spark-submit --name TPCDSHiveOnSpark 
    --class org.apache.spark.examples.sql.hive.TPCDSHiveOnSpark \
    /TestAutomation/sbt/target/scala-2.10/tpcdshiveonspark_2.10-1.2.1.jar \
    tpcds1000g \
    /TestAutomation/sbt/src/main/hive-queries/query91.sql

Local mode is limited to localhost’s memory and cores, and that reading HDFS blocks that reside on other data nodes will have to be copied via network. The most commonly seen error is ‘java.lang.OutOfMemoryError’ which is caused by a large dataset working with a small amount of heap available.

java.lang.OutOfMemoryError: Java heap space
  15/04/28 16:11:48 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-17,5,main]
  java.lang.OutOfMemoryError: Java heap space
  15/04/28 16:11:48 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-21,5,main]
  java.lang.OutOfMemoryError: Java heap space
  15/04/28 16:11:48 INFO spark.ContextCleaner: Cleaned broadcast 1
  15/04/28 16:11:48 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-8,5,main]
  java.lang.OutOfMemoryError: Java heap space
          at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
          at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
          at org.apache.spark.sql.columnar.BasicColumnBuilder.initialize(ColumnBuilder.scala:67)
          at org.apache.spark.sql.columnar.ComplexColumnBuilder.org$apache$spark$sql$columnar$NullableColumnBuilder$$super$initialize(ColumnBuilder.scala:81)
          at org.apache.spark.sql.columnar.NullableColumnBuilder$class.initialize(NullableColumnBuilder.scala:52)
          at org.apache.spark.sql.columnar.ComplexColumnBuilder.initialize(ColumnBuilder.scala:81)
          at org.apache.spark.sql.columnar.ColumnBuilder$.apply(ColumnBuilder.scala:162)
          at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$4.apply(InMemoryColumnarTableScan.scala:117)
          at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$4.apply(InMemoryColumnarTableScan.scala:114)

YARN client mode

In this mode, all Spark executors run as YARN containers, but there is no Application Master. The driver application runs as a stand-alone JVM, and you can kill the driver application (therefore the workload) the same way you would any other program (e.g., Ctrl-C or ‘kill -9 ‘). The command is as follows:

$ spark-submit  --master yarn-client --name TPCDSHiveOnSpark \
    --executor-memory 4096m \
    --num-executors 100 \
    --class org.apache.spark.examples.sql.hive.TPCDSHiveOnSpark \
    /TestAutomation/sbt/target/scala-2.10/tpcdshiveonspark_2.10-1.2.1.jar \
    tpcds1000g 
    /TestAutomation/sbt/src/main/hive-queries/query91.sql

The above command creates 100 executors of 4G each, yielding a total about 400GB memory (there is a overhead reserved so you actually get less than 400GB), calls a class named org.apache.spark.examples.sql.hive.TPCDSHiveOnSpark defined in the packaged named scala-2.10/tpcdshiveonspark_2.10-1.2.1.jar, passes the name TPCDSHiveOnSpark to YARN so it shows up in YARN job list, and takes in the program parameters tpcds1000g and a query defined in the file /TestAutomation/sbt/src/main/hive-queries/query91.sql.

The benefits of running Spark applications in ‘yarn-client’ mode are mainly 1) you do not need to copy your driver files (jar, shell scripts, data files, etc.) to all data nodes, and 2) you have control over on which host to run the driver JVM (e.g., your Windows desktop).

For some developers, the down side of running in ‘yarn-client’ mode is that your application output is printed to driver’s console and is intermingled with Spark engine standard output, making it difficult to see parse results at end. For example,

15/04/08 22:51:03 INFO scheduler.TaskSetManager: Finished task 178.0 in stage 4.0 (TID 3503) in 158 ms on bigaperf133.svl.ibm.com (198/200)
15/04/08 22:51:03 INFO scheduler.TaskSetManager: Finished task 195.0 in stage 4.0 (TID 3520) in 94 ms on bigaperf135.svl.ibm.com (199/200)
15/04/08 22:51:03 INFO scheduler.TaskSetManager: Finished task 154.0 in stage 4.0 (TID 3479) in 182 ms on bigaperf133.svl.ibm.com (200/200)
15/04/08 22:51:03 INFO scheduler.DAGScheduler: Stage 4 (takeOrdered at basicOperators.scala:183) finished in 0.449 s
15/04/08 22:51:03 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool
15/04/08 22:51:03 INFO scheduler.DAGScheduler: Job 1 finished: takeOrdered at basicOperators.scala:183, took 202.240924 s
[2000,10007009,brandunivamalg #9,1230939.160000001]
[2000,9002009,importomaxi #9,1220674.8400000012]
[2000,2004001,importoimporto #2,1218846.020000001]
[2000,7012006,importonameless #6,1217073.6200000003]
/TestAutomation/sbt/src/main/hive-queries/query03.sql completed
query time: 207.379336167 sec [program output]
15/04/08 22:51:03 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/json,null}
15/04/08 22:51:03 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs,null}
15/04/08 22:51:03 INFO ui.SparkUI: Stopped Spark web UI at http://bigaperf132.svl.ibm.com:4041
15/04/08 22:51:03 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/04/08 22:51:03 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
15/04/08 22:51:03 INFO cluster.YarnClientSchedulerBackend: Asking each executor to shut down
15/04/08 22:51:03 INFO cluster.YarnClientSchedulerBackend: Stopped
15/04/08 22:51:04 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
15/04/08 22:51:04 INFO storage.MemoryStore: MemoryStore cleared
15/04/08 22:51:04 INFO storage.BlockManager: BlockManager stopped
15/04/08 22:51:04 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
15/04/08 22:51:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/04/08 22:51:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/04/08 22:51:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
15/04/08 22:51:04 INFO spark.SparkContext: Successfully stopped SparkContext [Spark run-time output]

YARN cluster mode

This mode runs your Spark application like any other standard YARN job: an Application Master is created for your driver application, in addition to all containers created as Spark executors. YARN may also choose to run the driver application on any data node. The command is as follows:

$ spark-submit  --master yarn-cluster \
    --name TPCDSHiveOnSpark
    --files /usr/iop/4.0.0.0/spark/conf/hive-site.xml  \
    --executor-memory 4096m \
    --num-executors 100 \
    --jars \
        /usr/iop/4.0.0.0/spark/lib/datanucleus-core-3.2.10.jar,\
        /usr/iop/4.0.0.0/spark/lib/datanucleus-api-jdo-3.2.6.jar,\
        /usr/iop/4.0.0.0/spark/lib/datanucleus-rdbms-3.2.9.jar \
    --class org.apache.spark.examples.sql.hive.TPCDSHiveOnSpark \
    /TestAutomation/sbt/target/scala-2.10/tpcdshiveonspark_2.10-1.2.1.jar \
    tpcds1000g 
    /TestAutomation/sbt/src/main/hive-queries/query91.sql

The command above lists additional --files and --jars parameters: location of the hive-site.xml with the --files parameter, and datanucleus jars (Spark SQL dependent jars) with the --jars parameter. And the query file, /TestAutomation/sbt/src/main/hive-queries/query91.sql must exist on every node in the cluster.

The benefits of running in ‘yarn-cluster’ mode include 1) that YARN will create and manage all run-time components including the driver application, and 2) that your application output is printed to YARN job’s stdout file, making post-processing easy.

On the other hand, running in ‘yarn-cluster’ mode requires 1) that must pass dependent Spark jars to the ‘spark-submit’ command, 2) that you copy and synchronize all driver files (jar, shell scripts, data files) onto all data nodes — the only way for YARN to find them, and 3) you give up control over your driver application.

A common error seen when you forget to synchronize driver files is the following:

ERROR for yarn-cluster mode if you don't have accessible driver files:
15/04/17 14:39:09 INFO yarn.Client:
         client token: N/A
         diagnostics: Application application_1428338239323_0128 failed 2 times due to AM Container for appattempt_1428338239323_0128_000002 exited with  exitCode: 15
For more detailed output, check application tracking page:http://bigaperf133.svl.ibm.com:8088/proxy/application_1428338239323_0128/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.

Happy building and running your Spark application on IBM® Open Platform with Apache Hadoop!

To get started, download the IBM® Open Platform with Apache Hadoop.
Visit the Knowledge Center for installation and configuration information.

Join The Discussion

Your email address will not be published. Required fields are marked *