Apache Spark is a fast general purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. The Hadoop YARN-based architecture provides the foundation that enables Spark to share a common cluster and data set.

Apache Oozie is a workflow scheduler that is used to manage Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work as a directed acyclic graph (DAG) of actions. Oozie is reliable, scalable, extensible, and well integrated with the Hadoop stack, with YARN as its architectural center. It provides several types of Hadoop jobs out of the box, such as Java map-reduce, Pig, Hive, Sqoop, SSH, and DistCp, as well as system-specific jobs, such as Java programs and shell scripts.

You can use the high-level Spark APIs in Java, Scala, Python, and R to develop Spark applications in the big data platform, and then use Oozie to schedule Spark jobs.

This article is part 1 of a series that shows you how to use Oozie to schedule various Spark applications (written in Python, SparkR, SystemML, Scala, and SparkSQL) on YARN. Part 1 focuses on PySpark and SparkR with Oozie.

Oozie spark action overview

The Oozie spark action runs a Spark job, which is a Spark application that is written in Python, SparkR, SystemML, Scala, or SparkSQL, among others. Similar to other Oozie actions, the Oozie spark action also has a workflow.xml file and a job.properties file. To run the Spark job with an Oozie spark action, you have to configure the job-tracker, name-node, and Spark master elements in the job.properties file, and then configure the required jar element, spark-opts element, and arg element in the workflow.xml file.

Here is the syntax of the spark action workflow.xml file:

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.3">
    ...
    <action name="[NODE-NAME]">
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               <delete path="[PATH]"/>
               ...
               <mkdir path="[PATH]"/>
               ...
            </prepare>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <master>[SPARK MASTER URL]</master>
            <mode>[SPARK MODE]</mode>
            <name>[SPARK JOB NAME]</name>
            <class>[SPARK MAIN CLASS]</class>
            <jar>[SPARK DEPENDENCIES JAR / PYTHON FILE]</jar>
            <spark-opts>[SPARK-OPTIONS]</spark-opts>
            <arg>[ARG-VALUE]</arg>
                ...
            <arg>[ARG-VALUE]</arg>
            ...
        </spark>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

Some of these elements in Oozie workflow.xml are defined as follows:

  • The prepare element specifies a list of paths to create or delete before starting the job. The paths must start with: hdfs://host_name:port_number
  • The master element specifies the URL of the Spark Master; for example, spark://host:port, mesos://host:port, yarn-cluster, yarn-master, or local. For Spark on YARN mode, specify yarn-client or yarn-cluster in the master element. In this example, master=yarn-cluster.
  • The name element specifies the name of the Spark application.
  • The jar element specifies a comma-separated list of Python files.
  • The spark-opts element, if present, contains a list of Spark configuration options that can be passed to the Spark driver by specifying ‘-conf key=value’.
  • The arg element contains arguments that can be passed to the Spark application.

For detailed information about the Spark XML schema in Oozie, see https://oozie.apache.org/docs/4.3.0/DG_SparkActionExtension.html.

Scheduling a PySpark program on YARN with Oozie

Consider a simple word count application that is written in the Spark Python API. The following steps show you how to schedule and launch this PySpark job on YARN with Oozie. A full program listing appears at the end of the section.

First, here are some notes about prerequisites when you are running PySpark with yarn-cluster mode on a multi-node cluster:

  • When a Spark job is submitted, the Spark code checks for the PYSPARK_ARCHIVES_PATH environment variable. If PYSPARK_ARCHIVES_PATH cannot be found, Spark looks for SPARK_HOME. You can can set PYSPARK_ARCHIVES_PATH by using the oozie.launcher.yarn.app.mapreduce.am.env property.
  • The py4j-0.10.4-src.zip and pyspark.zip files (versions might vary depending on the Spark version) are necessary to run a Python script in Spark. Therefore, both files must be present in the classpath while the script is running. Simply put them under the lib/ directory for the workflow.
  • The –py-files option must be configured and passed in the <spark-opts> option.
  1. Create a workflow definition (workflow.xml). The following simple workflow definition executes one Spark job:
    <workflow-app xmlns='uri:oozie:workflow:0.5' name='PySpark'>
    <global>
        <configuration>
           <property>
               <name>oozie.launcher.yarn.app.mapreduce.am.env</name>
               <value>PYSPARK_ARCHIVES_PATH=pyspark.zip</value>
            </property>
        </configuration>
    </global>
        <start to='spark-node' />
        <action name='spark-node'>
            <spark xmlns="uri:oozie:spark-action:0.1">
                <job-tracker>${jobTracker}</job-tracker>
                <name-node>${nameNode}</name-node>
                <prepare>
                    <delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data"/>
                </prepare>
                <master>${master}</master>
                <name>MyApp</name>
                <jar>${nameNode}/user/${wf:user()}/${examplesRoot}/MyPython.py</jar>
                <spark-opts>--conf spark.driver.extraJavaOptions=-Diop.version=4.2.5.0 --conf spark.yarn.archive=hdfs://nn:8020/iop/apps/4.2.5.0-0000/spark2/spark2-iop-yarn-archive.tar.gz --py-files pyspark.zip,py4j-0.10.4-src.zip</spark-opts>
                <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data</arg>
                <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output-data</arg>
            </spark>
            <ok to="end" />
            <error to="fail" />
        </action>
        <kill name="fail">
            <message>Workflow failed, error
                message[${wf:errorMessage(wf:lastErrorNode())}]
            </message>
        </kill>
        <end name='end' />
    </workflow-app>
    
  2. Create an Oozie job configuration (job.properties).
    nameNode=hdfs://nn:8020
    jobTracker=rm:8050
    master=yarn-cluster
    queueName=default
    examplesRoot=spark-example
    oozie.use.system.libpath=true
    oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}
    
  3. Create an Oozie application directory. Create an application directory structure with the workflow definition and resources, as shown in the following example:
    +-~/spark-example/
      +-job.properties
      +-workflow.xml
      +-MyPython.py
      +-WordCount.txt
      +-lib
    
  4. Copy the application to the HDFS. Copy the spark-example/ directory to the user HOME directory in the HDFS. Ensure that the spark-example location in the HDFS matches the value of oozie.wf.application.path in job.properties.
    $ hadoop fs -put spark-example /user/ambari-qa/
    
  5. Copy py4j-0.10.4-src.zip and pyspark.zip to the HDFS.
    $ hadoop fs -put /usr/iop/current/spark-client/python/lib/pyspark.zip /user/ambari-qa/spark-example/lib
    $ hadoop fs -put /usr/iop/current/spark-client/python/lib/py4j-0.10.4-src.zip /user/ambari-qa/spark-example/lib
    
  6. Run the example job.
    1. Submit the Oozie job by running the following command:
      $ cd ~/spark-example
      
      $ oozie job -oozie http://oozie-host:11000/oozie -config ./job.properties –run
      job: 0000031-161115185001062-oozie-oozi-W
      
    2. Check the workflow job status:
      $ oozie job -oozie http://oozie-host:11000/oozie -info 0000031-161115185001062-oozie-oozi-W
      
      Job ID : 0000031-161115185001062-oozie-oozi-W
      ------------------------------------------------------------------------------------------------------------------------------------
      Workflow Name : PySpark
      App Path      : hdfs://oozie-host:8020/user/ambari-qa/spark-example
      Status        : SUCCEEDED
      Run           : 0
      User          : ambari-qa
      Group         : -
      Created       : 2016-11-16 08:21 GMT
      Started       : 2016-11-16 08:21 GMT
      Last Modified : 2016-11-16 08:22 GMT
      Ended         : 2016-11-16 08:22 GMT
      CoordAction ID: -
      
      Actions
      ------------------------------------------------------------------------------------------------------------------------------------
      ID                                                                            Status    Ext ID                 Ext Status Err Code  
      ------------------------------------------------------------------------------------------------------------------------------------
      0000031-161115185001062-oozie-oozi-W@:start:                                  OK        -                      OK         -         
      ------------------------------------------------------------------------------------------------------------------------------------
      0000031-161115185001062-oozie-oozi-W@spark-node                               OK        job_1479264601071_0068 SUCCEEDED  -         
      ------------------------------------------------------------------------------------------------------------------------------------
      0000031-161115185001062-oozie-oozi-W@end                                      OK        -                      OK         -         
      ------------------------------------------------------------------------------------------------------------------------------------	
      

The full PySpark program

from pyspark import SparkConf, SparkContext
from operator import add
def main():
    conf = SparkConf().setAppName("MyApp")
    sc = SparkContext(conf=conf)
    lines = sc.textFile("/user/ambari-qa/spark-example/WordCount.txt")
    words = lines.flatMap(lambda line: line.split(' '))
    wc = words.map(lambda x:(x,1))
    counts = wc.reduceByKey(add)
    counts.saveAsTextFile("wcres12")
if __name__ == '__main__':
    main()

Running a SparkR program on YARN with Oozie

Consider an example data frame application that is written in the SparkR API. The following steps show you how to schedule and launch this SparkR job on YARN with Oozie. A full program listing appears at the end of the section.

Keep in mind the following prerequisites when you are running SparkR with yarn-cluster mode on a multi-node cluster:

  • Set the SPARK_HOME environment variable by using the oozie.launcher.mapred.child.env property in the Oozie workflow.xml file.
  • Dataframe.R must be configured in the <jar> option.
  1. Create a workflow definition (workflow.xml). The following simple workflow definition executes one Spark job:
    <workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkR'>
    <global>
                <configuration>
                    <property>
                        <name>oozie.launcher.yarn.app.mapreduce.am.env</name>
                        <value>SPARK_HOME=/usr/iop/4.2.5.0-0000/spark2</value>
                    </property>
                </configuration>
    </global>
    <start to="sparkAction"/>
        <action name="sparkAction">
            <spark xmlns="uri:oozie:spark-action:0.1">
                    <job-tracker>${jobTracker}</job-tracker>
                    <name-node>${nameNode}</name-node>
                    <master>${master}</master>
                    <name>SparkR</name>
             <jar>${nameNode}/user/${wf:user()}/${examplesRoot}/dataframe.R</jar>
             <spark-opts>--conf spark.driver.extraJavaOptions=-Diop.version=4.2.5.0 --conf spark.yarn.archive=hdfs://nn:8020/iop/apps/4.2.5.0-0000/spark2/spark2-iop-yarn-archive.tar.gz</spark-opts>
             </spark>
          <ok to="end"/>
          <error to="fail"/> 
      </action>
      <kill name="fail">
            <message>Workflow failed, error
            message[${wf:errorMessage(wf:lastErrorNode())}]
            </message>
      </kill>
      <end name="end"/> 
    </workflow-app>
    
  2. Create an Oozie job configuration (job.properties).
    nameNode=hdfs://nn:8020
    jobTracker=rm:8050
    master=yarn-cluster
    queueName=default
    examplesRoot=spark-example
    oozie.use.system.libpath=true
    oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}
    
  3. Create an Oozie application directory. Create an application directory structure with the workflow definition and resources as shown in the following example:
    +-~/spark-example/
      +-job.properties
      +-workflow.xml
      +-Dataframe.R
    
  4. Copy the application to the HDFS. Copy the spark-example/ directory to the user HOME directory in the HDFS. Ensure that the spark-example location in the HDFS matches the value of oozie.wf.application.path in job.properties.
    $ hadoop fs -put spark-example /user/ambari-qa/
    
  5. Run the example job.
    1. Submit the Oozie job by running the following command:
      $ cd ~/spark-example
      
      $ oozie job -oozie http://oozie-host:11000/oozie -config ./job.properties –run
      job: 0000032-161115185001062-oozie-oozi-W
      
    2. Check the workflow job status:
      $ oozie job -oozie http://oozie-host:11000/oozie -info 0000032-161115185001062-oozie-oozi-W
      
      Job ID : 0000032-161115185001062-oozie-oozi-W
      ------------------------------------------------------------------------------------------------------------------------------------
      Workflow Name : SparkR
      App Path      : hdfs://oozie-host:8020/user/ambari-qa/spark-example
      Status        : SUCCEEDED
      Run           : 0
      User          : ambari-qa
      Group         : -
      Created       : 2016-11-16 08:21 GMT
      Started       : 2016-11-16 08:21 GMT
      Last Modified : 2016-11-16 08:22 GMT
      Ended         : 2016-11-16 08:22 GMT
      CoordAction ID: -
      
      Actions
      ------------------------------------------------------------------------------------------------------------------------------------
      ID                                                                            Status    Ext ID                 Ext Status Err Code  
      ------------------------------------------------------------------------------------------------------------------------------------
      0000032-161115185001062-oozie-oozi-W@:start:                                  OK        -                      OK         -         
      ------------------------------------------------------------------------------------------------------------------------------------
      0000032-161115185001062-oozie-oozi-W@spark-node                               OK        job_1479264601071_0068 SUCCEEDED  -         
      ------------------------------------------------------------------------------------------------------------------------------------
      0000032-161115185001062-oozie-oozi-W@end                                      OK        -                      OK         -         
      ------------------------------------------------------------------------------------------------------------------------------------	
      

The full SparkR program

library(SparkR)

# Initialize SparkContext and SQLContext
sc <- sparkR.init(appName="SparkR-DataFrame-example")
sqlContext <- sparkRSQL.init(sc)

# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))

# Convert local data frame to a SparkR DataFrame
df <- createDataFrame(sqlContext, localDF)

# Print its schema
printSchema(df)
# root
#  |-- name: string (nullable = true)
#  |-- age: double (nullable = true)
# Create a DataFrame from a JSON file

# path <- file.path(Sys.getenv("SPARK_HOME"), "examples/src/main/resources/people.json")
path <- file.path("file:///usr/iop/4.2.5.0-0000/spark2/examples/src/main/resources/people.json")
peopleDF <- read.json(sqlContext, path)
printSchema(peopleDF)

# Register this DataFrame as a table.
registerTempTable(peopleDF, "people")

# SQL statements can be run by using the sql methods provided by sqlContext
teenagers = 13 AND age <= 19")

# Call collect to get a local data.frame
teenagersLocalDF <- collect(teenagers)

# Print the teenagers in our dataset 
print(teenagersLocalDF)

# Stop the SparkContext now
sparkR.stop()

9 comments on"Scheduling a Spark job written in PySpark or SparkR on YARN with Oozie"

  1. For SparkR, I’m the getting this exception in the logs:
    Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Application application_ finished with failed status
    org.apache.spark.SparkException: Application application_ finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1143)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1194)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:745)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:311)
    at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:232)
    at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:58)
    at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:62)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:239)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)

    • jiaodongying July 20, 2017

      Could you please also add error message and exception from yarn log for this oozie spark R job? As there is no detail information in oozie log.
      Thanks

  2. Fernando Belo July 20, 2017

    Thanks for the very detailed tutorial.

    Unfortunately, I’m getting the following error when trying to run Python spark script via Oozie, using very similar configuration files as described. Any idea?

    Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, org/apache/spark/deploy/SparkSubmit
    java.lang.NoClassDefFoundError: org/apache/spark/deploy/SparkSubmit
    at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:222)
    at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:58)
    at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:62)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:237)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
    Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.SparkSubmit
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    … 16 more

    • DONGYINGJIAO July 20, 2017

      Did you have spark share lib installed? It is under /user/oozie/share/lib/lib_XXXX/spark on hdfs.
      And did you set oozie.use.system.libpath=true in job.properties?

      • Fernando Belo July 21, 2017

        Yes, just checked and I have sharelib on HDFS.

        job.properties:

        nameNode=hdfs://hadoop1.hadoopcwb:8020
        jobTracker=hadoop1.hadoopcwb:8032
        master=yarn-cluster
        queueName=default
        oozie.use.system.libpath=true
        oozie.wf.application.path=${nameNode}/analysis/000001/oozie

        workflow.xml

        oozie.launcher.yarn.app.mapreduce.am.env
        PYSPARK_ARCHIVES_PATH=pyspark.zip

        Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]

        ${jobTracker}
        ${nameNode}
        ${master}
        MyApp
        ${nameNode}/analysis/000001/oozie/dummy.py
        –conf spark.driver.extraJavaOptions=-Dhdp.version=2.6.1.0-129 –conf spark.yarn.archive=hdfs://hadoop1.hadoopcwb:8020/hdp/apps/2.6.1.0-129/spark2/spark2-hdp-yarn-archive.tar.gz –py-files pyspark.zip,py4j-0.10.4-src.zip

        • DONGYINGJIAO July 23, 2017

          Hi:
          This spark python job can run successfully in our cluster, Oozie version is 4.3.0 and Spark version 2.1.0. We built Oozie with spark 2.1.0, so the share lib for spark is also 2.1.0.
          What is your oozie verison, spark version, and spark share lib version? Maybe the spark configuration need slight change if not the same version with us. And not sure if there are two spark versions in HDP? If yes, maybe this leads to confliction.
          Thanks

  3. Melanie Flink October 12, 2017

    Thanks for the very detailed tutorial! Unfortunately I got the same error as Fernando Belo running pyspark and having very similar configurations (except using Spark 1.6… I adapted all configurations to Spark 1.6).
    Shared libs are installed. There aren’t two spark versions on HDP.
    Oozie version = 4.2.0.2
    Spark = 1.6.2
    –conf spark.driver.extraJavaOptions=-Dhdp.version=2.5.0.0
    –spark.yarn.archive=hdfs://bdp-dev/hdp/apps/2.5.0.0-1245/spark/spark-hdp-assembly.jar
    –py-files pyspark.zip,py4j-0.9-src.zip

    Do you know if it is working for the lower version, too? And do we need a main() function within the python-file?

    Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Application application_1507297731037_7513 finished with failed status
    org.apache.spark.SparkException: Application application_1507297731037_7513 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1122)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1169)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:289)
    at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:211)
    at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:51)
    at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:59)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:242)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

  4. Hi ,

    I am Scheduling a Spark job written in PySpark on YARN with Oozie. In mainScript.py am trying to call subScript.py But it’s throwing this error ” can’t open file ‘/user/PySpark/main/subScript.py’: [Errno 2] No such file or directory.”

    Path for mainScript.py : /user/PySpark/main/mainScript.py
    Path for subScript.py : /user/PySpark/main/subScript.py

    I am able to execute this through spark submit. But i need to execute this through oozie.

    Any suggestions would be highly appreciated…. 🙂

Join The Discussion

Your email address will not be published. Required fields are marked *