Animesh Singh, Anthony Amanse, Andy Shi | Updated July 16, 2018 - Published October 1, 2017
Apache Spark is an open source cluster-computing
framework. Spark runs on Apache Hadoop, Apache Mesos, standalone, or in the
cloud. It can access diverse data sources including HDFS, Apache Cassandra,
Apache HBase, and Amazon S3. Also, you can use it interactively from the Scala,
Python and R shells. You can run Spark using its standalone cluster mode, on an
IaaS, on Hadoop YARN, or on container orchestrators like Mesos.
z/OS is an extremely scalable and secure
high-performance operating system based on the 64-bit z/Architecture. z/OS is
highly reliable for running mission-critical applications, and the operating
system supports web- and Java-based applications.
In this tutorial, we demonstrate running an analytics application using Spark on
z/OS. Apache Spark on
z/OS provides an
in-place, optimized abstraction and real-time analysis of structured and
unstructured enterprise data that is powered by z Systems Community
z/OS Platform for Apache Spark includes a supported version of Apache Spark
open source capabilities consisting of the Apache Spark core, Spark SQL, Spark
Streaming, Machine Learning Library (MLib) and Graphx. It also includes
optimized data access to a broad set of structured and unstructured data
sources through Spark APIs. With this capability, traditional z/OS data
sources, such as IMS, VSAM, IBM Db2, z/OS, PDSE, or SMF data, can be accessed
in a performance-optimized manner with Spark
This analytics example uses data stored in Db2 and VSAM tables, and a machine
learning application written in Scala. The code also uses open source Jupyter
Notebook to write and submit Scala code to your Spark
instance, and view the output within a web GUI. The Jupyter Notebook is
commonly used in data analytics for data cleaning and transformation, numerical
simulation, statistical modeling, machine learning and much more.
The scenarios are accomplished by using:
IBM z/OS Platform for Apache
IBM Db2 for
Register at z Systems Community Cloud
for a trial account. You’ll receive an email containing credentials to access
the self-service portal. Use these credentials to start exploring all the
This how-to should take approximately one hour.
Open a web browser and enter the URL to access the z Systems Community
Cloud self-service portal.
Enter your Portal User ID and Portal Password, and click Sign In.
You see the home page for the z Systems Community Cloud self-service portal.
Click on Try Analytics Service.
You now see a dashboard showing the status of your Apache Spark on z/OS
At the top of the screen, notice the z/OS Status indicator, which should show
the status of your instance as OK.
In the middle of the screen, the Spark Instance, Status, Data management,
and Operations sections are displayed. The Spark Instance section contains
your individual Spark username and IP address.
Below the field headings, you can see buttons for functions that can be applied
to your instance.
The following table lists the operation for each function:
If it is the first time you are trying the Analytics Service on zOS, you
must set a new Spark password.
Confirm your instance is Active. If it is Stopped, click Start to start it.
Download all the sample files here.
Load the Db2 data file by clicking Upload Data. Select and load the Db2 DDL
file and the Db2 data file. Click Upload.
“Upload Success” appears in the dashboard when the data load is complete. The
VSAM data for this exercise has already been loaded for you. However, this step
may be repeated by loading the VSAM copybook and VSAM data file you downloaded,
from your local system.
To submit a prepared Scala program to analyze the data:
Spark Instance Username
Spark Instance Password
The arguments suggest you need to login to the GUI to view the job results.
“JOB Submitted” appears in the dashboard when the program is complete. This
Scala program accesses Db2 and VSAM data, performs transformations on the data,
joins these two tables in a Spark dataframe, and stores the result back to Db2.
Launch your individual Spark worker output GUI to view the job you just submitted.
Launch the Jupyter Notebook tool installed in the dashboard. This tool allows
you to write and submit Scala code to your Spark instance, and view the output
within a web GUI.
Launch the Jupyter Notebook service in your browser from your dashboard and
click on Jupyter to see the Jupyter home page.
The prepared Scala program in this level accesses Db2 and VSAM data, performs
transformations on the data, joins these two tables in a Spark dataframe, and
stores the result back to Db2. It also performs a logistic regression analysis
and plots the output.
Double click the Demo.jpynb file.
The Jupyter Notebook connects to your Spark on z/OS instance automatically and
is in the ready state when the Apache Toree – Scala indicator in the top right
hand corner of the screen is clear.
The Jupyter Notebook environment is divided into input cells labeled with *In
Run cell #1 – The Scala code in the first cell loads the VSAM data (customer
information) into Spark and performs a data transformation.
Click on the first In [ ]:
The left border changes to blue when a cell is in command mode, as shown below.
Before running the code, change the value of zOS_IP to your Spark IP address,
the value of zOS_USERNAME to your Spark username, and the value of
zOS_PASSWORD to your Spark password.
Click the run cell button indicated by the red box as shown below
The Jupyter Notebook connection to your Spark instance is in the busy state
when the Apache Toree – Scala indicator in the top right hand corner of the
screen is grey.
When this indicator turns clear, the cell run has completed and returned to the
ready state. The output should be similar to the following:
Run cell #2 – The Scala code in the second cell loads the Db2 data (transaction
data) into Spark and performs a data transformation.
Click on the next In [ ]: to select the next cell, and click the run cell
The output should be similar to the following:
Run cell #3 – The Scala code in the third cell joins the VSAM and Db2 data into
a new client_join dataframe in Spark.
Run cell #4 – The Scala code in the fourth cell performs a logistic regression
to evaluate the probability of customer churn as a function of customer
activity level. The result_df dataframe is also created, which is used to
plot the results on a line graph.
Run cell #5 – The Scala code in the fifth cell plots the ‘plot_df’ dataframe.
To get the number of rows in the input VSAM dataset, use:
The result should be 6001.
To get the number of rows in the input Db2 dataset, use:
The result should be 20000.
To get the number of rows in the joined dataset, use:
The result should be 112.
Congratulations on completing this how-to on running a Jupyter notebook that uses Apache Spark on z/OS! Recall that the z/OS Platform for Apache Spark includes a supported version of Apache Spark open source capabilities consisting of the Apache Spark core, Spark SQL, Spark Streaming, Machine Learning Library (MLib) and Graphx. Be sure to use these tools in conjunction with your new skills to analyze more data with Apache Spark on z/OS.
Back to top