Learn how to use the Apache® Spark™ Machine Learning Library (MLlib) in IBM Analytics for Apache Spark on IBM Cloud. Apache® Spark™ includes extension libraries that can be used for SQL and DataFrames, streaming, machine learning, and graph analysis. In this video, you’ll see how to use machine learning algorithms to determine the top drop off location for New York City taxis using a popular algorithm known as KMeans.
Try the tutorial
Learn how to use Apache® Spark™ machine learning algorithms to determine the top drop off location for New York City taxis using the KMeans algorithm.
Before you begin
Watch the Getting Started on Cloud video to create an IBM Cloud account and add the IBM Analytics for Apache Spark service.
Procedure 1: Download New York City taxi cab data
- Navigate to the NYC OpenData site.
- Click Transportation.
- For the search criteria, type taxi.
- Select the trip data of your choice, and download the data in CSV format. We recommend you select the 2013_Green_Taxi_Trip_data.csv file, or change the code found later in this tutorial to match the selected year.
Procedure 2: Create a Scala notebook
- Sign in to IBM Data Science Experience.
- From the menu, access My Projects, and open an existing project.
- Click Add Notebook, select Scala and Spark 2.0, type a name for the notebook, and click Create.
- Paste the following code into the first cell in the notebook, and then click the Run icon on the toolbar. This first cell contains two commands that set up use of the Apache® Spark™ machine learning algorithms KMeans and Vectors.
- In the Files slide out panel, drag and drop the CSV file you downloaded in procedure 1 into the box labelled Drop your file here.
- Next to the uploaded file, click Insert to code, then select Insert Spark RDD. This command uses your object storage credentials to read the contents of the file and assign it to the taxifile variable. It then displays the first 5 rows. Click Run.
When the results display, you’ll see that the first row will be the header for the columns, and the rest of the rows actually show data. In the first row, notice the dropoff_latitude and dropoff_longitude. And in the subsequent rows, we actually see data.
- Paste the following code into the fourth cell. This command filters this data, so we only see the records from 2013. And we also want to make sure that the dropoff_latitude and dropoff_longitude aren’t null. If you downloaded a different data set, the column numbers may be different.
filter(_.split(",") (4) !="").
filter(_.split(",") (5) !="")
- Paste the following code into the fifth cell, and then click Run. This filters the data containing drop off areas with latitudes and longitudes that are roughly in the Manhattan area.
val taxifence = taxidata.filter(_.split(",")(4).toDouble>40.70).
- Paste the following code into the sixth cell, and then click Run. This command takes this data and puts it in a vector which will be used as input for the KMeans algorithm.
- Paste the following SQL statement into the sixth cell, and then click Run. This final cell contains commands to invoke the KMeans algorithm. In this case, we’re looking for the top drop off location; however, the parameters could be changed in this cell to determine the top three or the top ten locations. It’s also interesting to note that Apache® Spark™ machine learning provides other algorithms for collaborative filtering, clustering, and classification.
Select and copy the coordinates. Then, open a browser, and paste the coordinates into a map program such as Google Maps to see the location on the map.