Depreciated. Please reference an updated post here

Introduction

Apache Zeppelin is a web-based notebook that enables interactive data ingestion, data discovery, data analytics, visualization and collaboration.

Current languages included are Scala (with Apache Spark), Python (with Apache Spark), SparkSQL, Hive, Markdown and Shell.

Objective

This technical document is intended to show viewers how to install and setup Zeppelin with Spark on BigInsights cluster.

Version Tested

  • BigInsights v3.0.0.1, v3.0.0.2
  • Apache Zeppelin v0.5.0
  • Apache Spark v1.3.x, v1.4.x
  • Redhat v6.x, CentOS v6.x

**For BigInsights v4.x, please reference post: https://developer.ibm.com/hadoop/blog/2015/09/30/setup-spark-notebook-biginsights/

Step 1: Install and Create Spark Cluster

  1. Download Apache Maven if not already. (http://maven.apache.org/download.cgi)
  2. Install Apache Maven:
    • sudo tar zxvf apache-maven-3.3.3-bin.tar.gz -C /usr/local
    • sudo ln -s /usr/local/apache-maven-3.3.3 /usr/local/maven;export MAVEN_HOME=/usr/local/maven;export PATH=$PATH:$MAVEN_HOME/bin
  3. Download Spark “Source” code either v1.3.1 or v1.3.0. (http://spark.apache.org/downloads.html)
  4. Compile and Install Spark
    1. sudo useradd spark:spark; sudo passwd spark
    2. sudo tar zxvf spark-1.3.1.tgz -C /usr/local
    3. sudo ln -s /usr/local/spark-1.3.1 /usr/local/spark
    4. sudo mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
    5. sudo chown -R spark:spark /usr/local/spark
    6. export SPARK_HOME=/usr/local/spark;export PATH=$PATH:$SPARK_HOME/bin
    7. Repeat above #1-6 on all Spark nodes
  5. Create Spark Cluster
    • su – spark
    • cp $SPARK_HOME/conf/slaves.template $SPARK_HOME/conf/slaves
    • Add all Spark nodes separated by new line.
      Example: cat slaves
      n1.ibm.com
      n2.ibm.com
      n3.ibm.com
    • $cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
      Add the following 3 lines in $SPARK_HOME/conf/spark-env.sh
      HADOOP_CONF_DIR=$BIGINSIGHTS_HOME/hadoop-conf
      SPARK_MASTER_PORT=7077
      SPARK_MASTER_WEBUI_PORT=8089
    • cp $SPARK_HOME/conf/spark-defaults.sh.template $SPARK_HOME/conf/spark-defaults.sh
      Add the following 2 lines in file: $SPARK_HOME/conf/spark-defaults.sh
      spark.master spark://bigdata.ibm.com:7077
      spark.driver.extraJavaOptions -Dhadoop.version=2.2.0
    • Add environment variables to “~spark/.bashrc”
      export SPARK_CLASSPATH=$SPARK_HOME/core/target/jars/guava-14.0.1.jar:$NOTEBOOK_HOME/interpreter/tajo/protobuf-java-2.5.0.jar:$SPARK_HOME/assembly/target/scala-2.10/spark-assembly-1.3.1-hadoop2.2.0.jar:$BIGINSIGHTS_HOME/IHC/lib/ibm-compression.jar:$BIGINSIGHTS_HOME/IHC/hadoop-core.jar:/$BIGINSIGHTS_HOME/bigindex/lib/hadoop-mapreduce-client-core-2.2.0.jarexport SPARK_JAR=hdfs://hostname:9000/user/spark/share/lib/spark-assembly.jar
    • Start Spark Cluster
      • su – spark
      • SPARK_HOME/sbin/start-all.sh

Step 2: Install and Configure Zeppelin

  1. sudo yum install npm git nodejs
  2. Download Zeppelin “Source” code (http://www.apache.org/dyn/closer.cgi/incubator/zeppelin/0.5.0-incubating)
  3. sudo unzip incubator-zeppelin-master.zip -d /usr/local;sudo ln -s /usr/local/incubator-zeppelin-master /usr/local/spark-notebook;sudo chown -R spark:spark /usr/local/spark-notebook;export SPARK_NOTEBOOK=/usr/local/spark-notebook
  4. su – spark
  5. git config –global url.”https://”.insteadOf git:// (Build will fail if missing this step! Notice two dashes before word “global”)
  6. mvn install -DskipTests -Dspark.version=1.3.1 -Dhadoop.version=2.2.0
  7. cp $SPARK_NOTEBOOK/conf/zeppelin-env.sh.template $SPARK_NOTEBOOK/conf/zeppelin-env.sh
  8. cp $SPARK_NOTEBOOK/conf/zeppelin-site.sh.template $SPARK_NOTEBOOK/conf/zeppelin-site.sh
  9. cp $SPARK_NOTEBOOK/conf/log4j.properties.template $SPARK_NOTEBOOK/conf/log4j.properties
  10. Add the following 5 lines in $SPARK_NOTEBOOK/conf/zeppelin-env.sh
    export HADOOP_CONF_DIR=$BIGINSIGHTS_HOME/hadoop-conf
    export ZEPPELIN_JAVA_OPTS=”-Dhadoop.version=2.2.0″
    export ZEPPELIN_PORT=8888
    export ZEPPELIN_NOTEBOOK_DIR=notebook
    export ZEPPELIN_INTERPRETERS=org.apache.zeppelin.spark.SparkInterpreter,org.apache.zeppelin.spark.PySparkInterpreter,org.apache.zeppelin.spark.SparkSqlInterpreter,org.apache.zeppelin.spark.DepInterpreter,org.apache.zeppelin.markdown.Markdown,org.apache.zeppelin.shell.ShellInterpreter,org.apache.zeppelin.hive.HiveInterpreter
  11. Modify file SPARK_NOTEBOOK/conf/zeppelin-site.sh
    Match port number listed in “zeppelin-env.sh”. Example:

    zeppelin-site.xml snippet
    zeppelin-site.xml snippet
  12. Start the Notebook process: SPARK_NOTEBOOK/bin/zeppelin-daemon.sh start

Step 3: Testing Zeppelin by reading a HDFS file

  • File “/tmp/emp2.txt” is under HDFS and NOT POSIX
Zepplin read HDFS file
Zepplin read HDFS file

 

4 comments on"Setup Spark Notebook (Zeppelin) with BigInsights v3.x"

  1. How do you delete your blog comment that you left on someone else blog?

    https://www.fiverr.com/s2/1892a6dbd5

Join The Discussion

Your email address will not be published. Required fields are marked *