spark logo

As everyone in the world of computing knows, Apache Spark is one of the most interesting and talked about projects in today’s open source community. Although Apache Spark is so much talked about, it is still a long way for being a “user friendly” application, especially in the application “building” area.

When building your own standalone Java application you would typically use something like Apache ANT or use built-in tools within your IDE to generate the required JAR file or any other construct which is required. Both of these subjects are covered in detail in lots of books and online documentation. Apache Spark typically uses a tool called SBT , which is a short-name for “Simple Build Tool”. This build tool is mainly used within the Scala ecosystem, and in some cases can become quite the opposite of “simple”. More can be found about the tool here : http://www.scala-sbt.org/.

When I started working with Apache Spark I had some major issues with SBT and its integration with Spark, so to help others avoid the same issues I encountered, I have decided to post this hands-on tutorial.

With SBT and Spark, one can build a Spark standalone application or an application on a commercial Hadoop distribution which is YARN-enabled. Most of us will probably not configure Hadoop from scratch and will use some kind of a commercial distribution. I have used IBM BigInsights 4.0 quick start edition (now called IBM IOP for Hadoop) for this purpose.

So, the tutorial contains two parts:
Part 1: building your first Spark standalone application
Part 2: building your first Spark application on IBM IOP for Hadoop (BigInsights) which is “yarn enabled”

Part 1 – Building your first Spark standalone application

  • Step 1: install SBT on the target machine
  • Step 2: code the simple program
  • Step 3: copy the file into the SBT enabled system
  • Step 4: create the input text file at /home/Spark/input.txt
  • Step 5: create and edit the simple.sbt file
  • Step 6: create the mkDirStructure.sh to automate the directory creation
  • Step 7: run the mkDirStructure.sh
  • Step 8: package the Spark application
  • Step 9: run the Spark application

The contents of part 1 can be downloaded here: download the first part here.

Part 2 – Building your first Spark application on IBM IOP for Hadoop:

  • Step 1: install sbt on the target machine (ubuntu linux)
  • Step 2: code the simple program (yarn-client compatible)
  • Step 3: copy the file into the SBT enabled system
  • Step 4: create and edit the simpleCluster.sbt
  • Step 5: create the mkDirStructure.sh to automate the directory creation
  • Step 6: run the mkDirStructure.sh
  • Step 7: package the Spark application
  • Step 8: create the input on BigInsights system
  • Step 9: move the jar to the BigInsights driver machine
  • Step 10: run the Spark application on the BigInsights machine

Download the second part here.

Join The Discussion

Your email address will not be published. Required fields are marked *