Learn how to use the Apache® Spark™ Machine Learning Library (MLlib) with IBM Analytics for Apache Spark in IBM Watson Studio. Apache® Spark™ includes extension libraries that can be used for SQL and DataFrames, streaming, machine learning, and graph analysis. In this video, you’ll see how to use machine learning algorithms to determine the top drop off location for New York City taxis using a popular algorithm known as KMeans.

Try the tutorial

Learn how to use Apache® Spark™ machine learning algorithms to determine the top drop off location for New York City taxis using the KMeans algorithm.

Procedure 1: Download New York City taxi cab data

  1. Navigate to the NYC OpenData site.
  2. For the search criteria, type taxi.
  3. Select the trip data of your choice, and download the data in CSV format. We recommend you select the 2013_Green_Taxi_Trip_data.csv file, or change the code found later in this tutorial to match the selected year.

Procedure 2: Create a Scala notebook

  1. Sign in to IBM Watson Studio, and if you don’t have a project yet for Jupyter notebooks, watch this video to learn how.
  2. From the menu, access My Projects, and open an existing project.
  3. Click Add Notebook, select Scala and Spark 2.0, type a name for the notebook, and click Create.
  4. Paste the following code into the first cell in the notebook, and then click the Run icon on the toolbar. This first cell contains two commands that set up use of the Apache® Spark™ machine learning algorithms KMeans and Vectors.
    import org.apache.spark.ml.clustering.KMeans
    import org.apache.spark.ml.linalg.Vector
    import org.apache.spark.ml.linalg.Vectors
    import org.apache.spark.ml.feature.VectorAssembler
  5. In the Files slide out panel, drag and drop the CSV file you downloaded in procedure 1 into the box labelled Drop your file here.
  6. Next to the uploaded file, click Insert to code, then select Insert SparkSession DataFrame. This command uses your object storage credentials to read the contents of the file and assign it to a variable. It then displays the first 5 rows. Rename the inserted variable (for example, dfData1) to taxifile so the rest of the code will work correctly. Click Run.

    When the results display, you’ll see that the first row will be the header for the columns, and the rest of the rows actually show data. In the first row, notice the dropoff_latitude and dropoff_longitude. And in the subsequent rows, we actually see data.
  7. Paste the following code into the next cell, and then click Run. This filters the data containing drop off areas with latitudes and longitudes that are roughly in the Manhattan area.
    val taxifence = spark.sql("""select Dropoff_latitude,Dropoff_longitude from taxifile where
    Dropoff_latitude > 40.70 and
    Dropoff_latitude < 40.86 and
    Dropoff_longitude > -74.02 and
    Dropoff_longitude < -73.93""")
  8. Paste the following code into the next cell, and then click Run. This command takes this data and puts it in a vector which will be used as input for the KMeans algorithm.
    val assembler = (new VectorAssembler()
    .setInputCols(Array("Dropoff_latitude", "Dropoff_longitude"))
    val taxivector = assembler.transform(taxifence)
    val taxifeat = taxivector.drop("Dropoff_latitude","Dropoff_longitude")
  9. Paste the following SQL statement into the sixth cell, and then click Run. This final cell contains commands to invoke the KMeans algorithm. In this case, we're looking for the top drop off location; however, the parameters could be changed in this cell to determine the top three or the top ten locations. It's also interesting to note that Apache® Spark™ machine learning provides other algorithms for collaborative filtering, clustering, and classification.
    val kmeans = new KMeans().setK(2).setSeed(1L)
    val model = kmeans.fit(taxifeat)
    println("Cluster Centers: ")

Select and copy the coordinates. Then, open a browser, and paste the coordinates into a map program such as Google Maps to see the location on the map.

16 comments on"Use the Spark Machine Learning Library"

  1. Hello,

    I can’t find the “Insert Spark RDD” option, just the “Insert SparkSession Dataframe” and “Insert Credentials”. Can someone help?

    • You should use Insert SparkSession Dataframe. Step 6 was updated to reflect that change.

  2. Cesar Eduardo da Silva March 23, 2018

    When I paste the code required on Step 7, the notebook returns the following error;
    Name: Compile Error
    Message: :29: error: not found: value taxifile
    val taxidata=taxifile.filter(_.contains(“2013”)).

    What could have happened?

    • Step 6 was recently updated which indicates to use a SparkSession Dataframe and change the variable name to taxifile before running the cell. Step 7 will work after this change.

  3. Ian Wakley April 17, 2018

    How do I get the Scala and Spark 2.0 options in a new project? All I get is Python 3.5 and Default Anaconda or Default R environments.

    • Ian Wakley April 17, 2018

      Ok managed to solve this issue myself.
      I needed to add the Spark service to the project on the project settings tab.
      Then I get the scala and spark options when adding a new notebook.

  4. On step 7 I get the following error:
    Name: Compile Error
    Message: :41: error: value contains is not a member of org.apache.spark.sql.Row
    val taxidata=taxifile.filter(_.contains(“2013”)).

    • Hi John. There were some significant changes to this notebook that are reflected in the video, and just updated in the tutorial section. Please try running the new code above.

  5. Eric Michiels February 13, 2019

    Under number 6 I read: Rename the variable to taxifile so the rest of the code will work correctly. Click Run.
    1. Which Variable do you mean?
    2. When I click Run, nothing happens

    • Eric, when you use the Insert to code option, a variable named something like dfData1 will be inserted. That’s the variable you should rename to taxifile before you run the cell, and then continue with the rest of the steps below.

  6. I get the following message when I perform step#8:

    Name: java.lang.IllegalArgumentException
    Message: Data type StringType of column Dropoff_latitude is not supported.
    StackTrace: at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
    at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
    at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)

    Do I need to do something different in step#8? Kindly advise, thank you!

  7. Shubhajit Paul March 27, 2019

    Can you please tell me how can I Go through Step 6? I am not getting Insert to Code Option? Where I can find this option?

  8. If you are interested in the Python Version of this, I created one here:

Join The Discussion

Your email address will not be published. Required fields are marked *