Easily spin up an Apache Spark cluster and then do high-level math? Yes! Believe it or not, with these steps it will be a lot easier than you think. The following is a very simple way of using Apache SystemML for all of your machine learning and big data needs. This tutorial will get you set up and running SystemML in a Spark shell using IAE like a star.

Learning objectives

At a high-level, SystemML is what is used for the machine learning and mathematical part of your data science project. You can log in to a Spark shell, load SystemML in the shell, load your data and write your linear algebra, statistical equations, matrices, etc. in code much shorter than it would be in the typical Spark shell syntax. SystemMl helps not only with mathematical exploration and machine learning algorithms, but also allows you to be on Spark where you can do all of the above with really big data that you couldn’t use on your local computer.

Prerequisites

By using the IBM Analytics Engine you can spin up a Spark cluster in just a few minutes using the web user interface.

With both of these tools, I’ll walk you through how to set up your computer for all of SystemML’s assumptions, how to set up IAE and your Spark cluster, use SSH to connect to your Spark cluster on your computer, start a Spark shell, and import SystemML, and finally, load some data and do a few examples in Scala. Whew that’s a lot, but I promise to go through it all!

Now let’s get going and start learning!

Estimated time

Completing this tutorial should take about 30 minutes.

Steps

Create an IBM Analytics Engine service

  1. Log in to IBM Cloud and find the IBM Analytics Engine service, or navigate directly to https://cloud.ibm.com/catalog/services/ibm-analytics-engine

  2. Create an instance of the service (you can leave accept all defaults for the purpose of this tutorial) and push “Create” at the bottom of the page. This may take a few minutes.

    IBM_Analytics_Engine_Create

  3. Once your cluster has been created, make sure you are in the “Manage” section. If you are not, navigate to it! In this section you’ll notice there is a lot of information. Be sure to note your username, password and SSH (under “Connection Details”) for use in later steps.

    IBM_Analytics_Engine_Manage

  4. Start by copying your SSH line.

  5. Go to your terminal and paste the SSH line in it and press enter.

  6. You’ll be prompted for a password. Use the password given to you on the “Manage” console.

Note: The IBM Analytics Engine page has great material to learn more about the IBM Analytics Engine.

Logged in? You’re a rockstar! Now we can start SystemML!

Access SystemML from the Spark shell

First download SystemML (from your terminal)

wget https://sparktc.ibmcloud.com/repo/latest/SystemML.jar

Now type the following code to access the Spark shell with SystemML.

spark-shell --executor-memory 4G --driver-memory 4G --jars SystemML.jar

Now, using the Spark Shell (Scala), import the MLContext to start SystemML.

    import org.apache.sysml.api.mlcontext._
    import org.apache.sysml.api.mlcontext.ScriptFactory._
    val ml = new MLContext(sc)

ApacheSystemML_On_Spark_IAE

Congratulations!! You are not in Apache SystemML! Look at you!

In the future you will just need to do the last two steps to get this going and you can also repeat these last steps on a local Spark Shell.

Run a workload with SystemML on Spark shell

Let’s now figure out how to load a script and run it, then, load data and run some examples. This way you can get familiar with Spark shell and SystemML.

For this example, we’ll be using a script from a URL. Here s1 is created by reading Univar-Stats.dml from a URL address.

val uniUrl = "https://raw.githubusercontent.com/apache/incubator-systemml/master/scripts/algorithms/Univar-Stats.dml"
val s1 = ScriptFactory.dmlFromUrl(uniUrl)

Our next step is to parallelize the information, read in two matrices as RDDs, get the sum of the first, the sum of the second, and output a message.

val data1 = sc.parallelize(Array("1.0,2.0", "3.0,4.0"))
val data2 = sc.parallelize(Array("5.0,6.0", "7.0,8.0"))
val s = """
  | s1 = sum(m1);
  | s2 = sum(m2);
  | if (s1 > s2) {
  |  message = "s1 is greater"
  | } else if (s2 > s1) {
  |  message = "s2 is greater"
  | } else {
  |  message = "s1 and s2 are equal"
  | }
  | """

val script = dml(s).in("m1",data1).in("m2", data2).out("s1","s2", "message")

You should get:

script: org.apache.sysml.api.mlcontext.Script =
Inputs:
[1] (RDD) m1: ParallelCollectionRDD[0] at parallelize at <console>:33
[2] (RDD) m2: ParallelCollectionRDD[1] at parallelize at <console>:33

Outputs:
[1] s1
[2] s2
[3] message

Now print your script info.

println(script.info)

You should see:

Script Type: DML

Inputs:
[1] (RDD) m1: ParallelCollectionRDD[0] at parallelize at <console>:33
[2] (RDD) m2: ParallelCollectionRDD[1] at parallelize at <console>:33

Outputs:
[1] s1
[2] s2
[3] message

Input Parameters:
None

Input Variables:
[1] m1
[2] m2

Output Variables:
[1] s1
[2] s2
[3] message

Symbol Table:
[1] (Matrix) m1: Matrix: null, [-1 x -1, nnz=-1, blocks (1 x 1)], csv, not-dirty
[2] (Matrix) m2: Matrix: null, [-1 x -1, nnz=-1, blocks (1 x 1)], csv, not-dirty

Script String:

s1 = sum(m1);
s2 = sum(m2);
if (s1 > s2) {
message = "s1 is greater"
} else if (s2 > s1) {
message = "s2 is greater"
} else {
message = "s1 and s2 are equal"
}

Script Execution String:
m1 = read('');
m2 = read('');

s1 = sum(m1);
s2 = sum(m2);
if (s1 > s2) {
  message = "s1 is greater"
} else if (s2 > s1) {
message = "s2 is greater"
} else {
  message = "s1 and s2 are equal"
}
write(s1, '');
write(s2, '');
write(message, '');

Now execute your script and get your results!

val results = ml.execute(script)

You should get:

results: org.apache.sysml.api.mlcontext.MLResults =
[1] (Double) s1: 10.0
[2] (Double) s2: 26.0
[3] (String) message: s2 is greater

Just as an example, you can set your value as x and get your results in Double form.

val x = results.getDouble("s1")

You should get:

x: Double = 10.0

Set value as y:

val y = results.getDouble("s2")

You should get:

y: Double = 26.0

Simple example of adding both:

x + y

You should get:

res1: Double = 36.0

Here is another version of the example. Because the API is very Scala friendly, you can pull out your results as a Scala tuple.

val (firstSum, secondSum, sumMessage) = results.getTuple[Double, Double, String]("s1", "s2", "message")

You should get:

firstSum: Double = 10.0
secondSum: Double = 26.0
sumMessage: String = s2 is greater

Here’s another really handy part. As an additional example, you can load in your data, type the short code, and get a whole table of standard statistical measures for each feature!

To do this, let’s first get our data into Spark and run a SystemML script.

val habermanUrl = "https://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data"
val habermanList = scala.io.Source.fromURL(habermanUrl).mkString.split("\n")
val habermanRDD = sc.parallelize(habermanList)
val habermanMetadata = new MatrixMetadata(306, 4)
val typesRDD = sc.parallelize(Array("1.0,1.0,1.0,2.0"))
val typesMetadata = new MatrixMetadata(1, 4)
val scriptUrl = "https://raw.githubusercontent.com/apache/systemml/master/scripts/algorithms/Univar-Stats.dml"
val script = dmlFromUrl(scriptUrl).in("A", habermanRDD, habermanMetadata).in("K", typesRDD, typesMetadata).in("$CONSOLE_OUTPUT", true).out("baseStats")

val results = ml.execute(script)

You should see:

    Feature [1]: Scale
     (01) Minimum             | 30.0
     (02) Maximum             | 83.0
     (03) Range               | 53.0
     (04) Mean                | 52.45751633986928
     (05) Variance            | 116.71458266366658
     (06) Std deviation       | 10.803452349303281
     (07) Std err of mean     | 0.6175922641866753
     (08) Coeff of variation  | 0.20594669940735139
     (09) Skewness            | 0.1450718616532357
     (10) Kurtosis            | -0.6150152487211726
     (11) Std err of skewness | 0.13934809593495995
     (12) Std err of kurtosis | 0.277810485320835
     (13) Median              | 52.0
     (14) Interquartile mean  | 52.16013071895425
-------------------------------------------------
    Feature [2]: Scale
     (01) Minimum             | 58.0
     (02) Maximum             | 69.0
     (03) Range               | 11.0
     (04) Mean                | 62.85294117647059
     (05) Variance            | 10.558630665380907
     (06) Std deviation       | 3.2494046632238507
     (07) Std err of mean     | 0.18575610076612029
     (08) Coeff of variation  | 0.051698529971741194
     (09) Skewness            | 0.07798443581479181
     (10) Kurtosis            | -1.1324380182967442
     (11) Std err of skewness | 0.13934809593495995
     (12) Std err of kurtosis | 0.277810485320835
     (13) Median              | 63.0
     (14) Interquartile mean  | 62.80392156862745
-------------------------------------------------
    Feature [3]: Scale
     (01) Minimum             | 0.0
     (02) Maximum             | 52.0
     (03) Range               | 52.0
     (04) Mean                | 4.026143790849673
     (05) Variance            | 51.691117539912135
     (06) Std deviation       | 7.189653506248555
     (07) Std err of mean     | 0.41100513466216837
     (08) Coeff of variation  | 1.7857418611299172
     (09) Skewness            | 2.954633471088322
     (10) Kurtosis            | 11.425776549251449
     (11) Std err of skewness | 0.13934809593495995
     (12) Std err of kurtosis | 0.277810485320835
     (13) Median              | 1.0
     (14) Interquartile mean  | 1.2483660130718954
-------------------------------------------------
    Feature [4]: Categorical (Nominal)
     (15) Num of categories   | 2
     (16) Mode                | 1
     (17) Num of modes        | 1
    results: org.apache.sysml.api.mlcontext.MLResults =
    [1] (Matrix) baseStats: Matrix: scratch_space/_p5250_9.31.116.229/parfor/2_resultmerge1, [17 x 4, nnz=44, blocks (1000 x 1000)], binaryblock, dirty

You can also ask for the base stats as a SystemML Matrix.

val baseStats = results.getMatrix("baseStats")

You should get:

baseStats: org.apache.sysml.api.mlcontext.Matrix = org.apache.sysml.api.mlcontext.Matrix@237cd4e5

With that matrix, you can convert it into a bunch of different things!

baseStats.
asDataFrame          asDoubleMatrix       asInstanceOf         asJavaRDDStringCSV   asJavaRDDStringIJV   asMLMatrix           asMatrixObject       asRDDStringCSV
asRDDStringIJV       isInstanceOf         toString

For instance, you can get the base stats as an RDD. Note: IJV leaves out non values and CSV includes them. Here’s an example of both:

baseStats.toRDDString

You should get these options:

toRDDStringCSV   toRDDStringIJV

Try as a CSV:

baseStats.toRDDStringCSV.collect

You should get:

res4: Array[String] = Array(30.0,58.0,0.0,0.0, 83.0,69.0,52.0,0.0, 53.0,11.0,52....1.0)

Try as a IJV:

baseStats.toRDDStringIJV.collect

You should get:

res5: Array[String] = Array(1 1 30.0, 1 2 58.0, 1 3 0.0, 1 4 0.0, 2 1 83.0, 2 2 69.0, 2 3 52.0, 2 4 0.0, ... 1...

This is a great start to using SystemML with Spark shell on the IBM Analytics Engine! Once you’re done you can quit to exit.

:quit

Summary

Congratulations!! You have successfully set up your computer to run SystemML and Spark on the IBM Analytics Engine (IAE), loaded the Spark shell, ran scripts, loaded data, and ran some examples!! Now go save the world with these awesome new skills.