Big R provides an end-to-end integration of R within IBM InfoSphere BigInsights. This makes it easy to write and execute R programs that operate on data stored in a Hadoop cluster.

Using Big R, an R user can explore, transform, and analyze big data hosted in a BigInsights cluster using familiar R syntax and paradigm. All of these capabilities are accessible from a standard R client.

Big R provides the following capabilities:

  1. Enable the use of R as a query language for big data: Big R hides many of the complexities pertaining to the underlying Hadoop / MapReduce framework. Using classes such as bigr.frame, bigr.vector and bigr.list, a user is presented with an API that is heavily inspired by R’s foundational API on data.frames, vectors and frames.
  2. Enable the pushdown of R functions such that they run right on the data: Via mechanisms such as groupApply, rowApply and tableApply, user-written functions composed in R can be shipped to the cluster. BigInsights transparently parallelizes execution of these functions and provides consolidated results back to the user. Almost any R code, including most packages available on open-source repositories such as CRAN (Comprehensive R Archive Network), can be run using this mechanism.

Part 1: Getting Started with Big R

During this tutorial, you’ll learn how to:

  • Connect to BigInsights from R
  • Learn about Big R objects such as bigr.frames, bigr.vectors and bigr.lists.
  • Run queries and aggregations on big data from R
  • Mix R and Big R constructs to create powerful visualizations
  • Perform in-database analytics with R

The following exercises are also available in a file named ‚ÄúBigRLab4.R‚ÄĚ in the ‚Äú/home/biadmin/labs-bigr/‚ÄĚ directory. You can open this file in RStudio and follow along. When practical, it is suggested that you type in the code in the ‚ÄúConsole‚ÄĚ to see how Big R operates. For longer batches of statements, you can certainly copy-and-paste the statements into the RStudio console and execute them.

__1.            Start BigInsights

From the Desktop, click on the ‚ÄúStart BigInsights‚ÄĚ icon.

image001

This action will start up various BigInsights components, including HDFS and Map/Reduce, on your machine. A Terminal window will pop up that will indicate progress. Eventually, the Terminal window will disappear. Once done, return to RStudio.

image002

__2.            Back to RStudio

Return to RStudio session in the browser.

[code language=”r”]>setwd("~/labs-bigr")¬†¬†¬†¬† # change directory[/code] [code language=”r”]>rm(list = ls())¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† # clear workspace[/code]

Load Big R Load the Big R package into your R session.

[code language=”r”]>library(bigr)[/code]

__3.            Connect to BigInsights

Note how one needs to specify the right credentials for this call.

[code language=”r”]>bigr.connect(host="localhost", port=7052,[/code] [code language=”r”]>user="biadmin", password="biadmin")[/code]

Verify that the connection was successful.

[code language=”r”]>is.bigr.connected()[/code]
[1] TRUE

If you ever lose your connection during these exercises, run the following line to reconnect. You can also invoke bigr.connect() as we did above.

[code language=”r”]>bigr.reconnect()[/code]
[1] TRUE

__4.            Browse files on HDFS

Once connected, you will be able to browse the HDFS file system and examine datasets that have already been loaded onto the cluster.

[code language=”r”]>bigr.listfs()¬†¬†¬†¬†¬†¬†¬†¬†¬†# List files under root "/"[/code]
path   owner length blockSize permission

1 /biginsights biadmin     0         0 rwxr-xr-x

2     /hadoop biadmin     0         0 rwxr-xr-x

3       /hbase biadmin     0         0 rwxr-xr-x

4         /tmp biadmin     0         0 rwxrwxrwx

5       /user biadmin     0         0 rwxrwxrwx

 

[code language=”r”]>bigr.listfs("/user/biadmin")¬†¬†¬†¬† # List files under /user/biadmin[/code]
path   owner   length blockSize permission

1 /user/biadmin/airline_lab.csv biadmin 12055574 134217728 rw-r--r--

2       /user/biadmin/credstore biadmin       0         0 rwx--x--x

 

Part 2 – Using R as a query language for Big Data

The following exercises are available in a file named ‚ÄúBigRLab5.R‚ÄĚ in the ‚Äú/home/biadmin/labs-bigr/‚ÄĚ directory.

__1.            Connect to a big data set

“/user/biadmin/airline_lab.csv” is one of the datasets that you will see on HDFS. This is a comma-delimited file (type = “DEL”). Let’s connect to it and explore it a bit. This is done by creating a bigr.frame over the dataset. A bigr.frame is an R object that mimics R’s own data.frame. However, unlike R, a bigr.frame does not load that data in memory as that would be impractical. The data stays in HDFS. However, you will still be able to explore this data using the Big R API.

[code language=”r”]>air <- bigr.frame(dataSource="DEL",

>dataPath="/user/biadmin/airline_lab.csv")[/code]

Check that “air” is an object of “bigr.frame”

[code language=”r”]>class(air)[/code]
[1] "bigr.frame"
[code language=”r”]>attr(,"package")[/code]
[1] "bigr"

Basic exploration

__2.            Exploring table metadata

Examine the structure of the dataset. Note that the output looks very similar to R’s data.frames. The dataset has 29 variables (i.e. columns). The first few values of each column are also shown. Examine the columns and see what they may possibly represent.

[code language=”r”]>str(air)[/code]
'bigr.frame': 29 variables:

$ Year             : chr "2004" "2004" "2004" "2004" "2004" "2004"

$ Month           : chr "2" "2" "2" "2" "2" "2"

$ DayofMonth       : chr "12" "16" "18" "19" "21" "24"

$ DayOfWeek       : chr "4" "1" "3" "4" "6" "2"

$ DepTime         : chr "633" "2115" "700" "1140" "936" "1117"

$ CRSDepTime       : chr "635" "2120" "700" "1145" "935" "1120"

$ ArrTime         : chr "935" "2340" "817" "1427" "1036" "1922"

$ CRSArrTime       : chr "930" "2350" "820" "1420" "1035" "1930"

$ UniqueCarrier   : chr "B6" "B6" "B6" "B6" "B6" "B6"

$ FlightNum       : chr "165" "199" "2" "67" "68" "206"

$ TailNum         : chr "N553JB" "N570JB" "N544JB" "N570JB" "N544JB" "N548JB"

$ ActualElapsedTime: chr "182" "325" "77" "167" "60" "305"

$ CRSElapsedTime   : chr "175" "330" "80" "155" "60" "310"

$ AirTime         : chr "162" "114" "49" "141" "41" "468"

$ ArrDelay         : chr "5" "-10" "-3" "7" "1" "-8"

$ DepDelay         : chr "-2" "-5" "0" "-5" "1" "-3"

$ Origin           : chr "JFK" "JFK" "JFK" "RSW" "JFK" "LGB"

$ Dest             : chr "TPA" "LAS" "BUF" "JFK" "SYR" "JFK"

$ Distance         : chr "1005" "2248" "301" "1074" "209" "2465"

$ TaxiIn           : chr "3" "8" "2" "7" "3" "7"

$ TaxiOut         : chr "17" "23" "26" "19" "16" "10"

$ Cancelled       : chr "0" "0" "0" "0" "0" "0"

$ CancellationCode : chr "NA" "NA" "NA" "NA" "NA" "NA"

$ Diverted         : chr "0" "0" "0" "0" "0" "0"

$ CarrierDelay     : chr "0" "0" "0" "0" "0" "0"

$ WeatherDelay     : chr "0" "0" "0" "0" "0" "0"

$ NASDelay         : chr "0" "0" "0" "0" "0" "0"

$ SecurityDelay   : chr "0" "0" "0" "0" "0" "0"

$ LateAircraftDelay: chr "0" "0" "0" "0" "0" "0"

__3.            Assign column types

Notice that the column types are all “character” (abbreviated as “chr”). Unless specified otherwise, Big R automatically assumes all data to be strings. However, we know that only columns Year (1), Month (2), UniqueCarrier (9), TailNum (11), Origin (17), Dest (18), CancellationCode (23) are strings, while the rest are numbers. Let us assign the correct column types.

First, we build a vector that holds the column types for all columns

[code language=”r”]>ct <- ifelse(1:29 %in% c(1,2,9,11,17,18,23),

>"character", "integer")

>print(ct)[/code]

[1] "character" "character" "integer"   "integer"   "integer"   "integer"

[7] "integer"   "integer"   "character" "integer"   "character" "integer"

[13] "integer"   "integer"   "integer"   "integer"   "character" "character"

[19] "integer"   "integer"   "integer"   "integer"   "character" "integer"

[25] "integer"   "integer"   "integer"   "integer"   "integer"

Assign the column types

[code language=”r”]>coltypes(air) <- ct[/code]

__4.            Data dimensions

This data originally comes from US Department of Transportation (http://www.rita.dot.gov), and it provides us information on every US flight over the past couple of decades. The original data has approximately 125+ million rows. For this lab, we’re only using a small sample. Let us examine the dimensions of the dataset.

[code language=”r”]>nrow(air)¬†¬†¬†¬† # Number of rows (i.e. flights) in the data[/code]
[1] 128790
[code language=”r”]>ncol(air)¬†¬†¬†¬† # Number of attributes recorded for each flight[/code]
[1] 29

Data summarization

__5.            Summarizing frames

Let us summarize some key columns to gain further understanding of this data. You’ll see that the years range from 1987-2008.

[code language=”r”]>summary(air[, c("Year", "Month", "UniqueCarrier")])[/code]
Year       Month   UniqueCarrier

A Min. :1987 Min. :1 Min. :9E

B Max. :2008 Max. :9 Max. :YV

__6.            Summarizing vectors and basic visualization

Summarizing columns one by one will give us additional information. In some cases, we will also visualize the information.

The following statement shows us the distribution of flights by year. Again, we have 22 years worth of data. What you will see is a vector that has the “year” for the name, and the flight count for the values.

[code language=”r”]>summary(air$Year)[/code]
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002

1364 5408 5282 5518 5279 5300 5325 5394 5532 5560 5635 5614 5748 5915 6165 5475

2003 2004 2005 2006 2007 2008

6810 7455 7463 7469 7753 7326

Let us glue Big R’s summary() with R’s visualization capabilities to see the same data distribution graphically. Before you do this, make sure the “Plots” window (lower right hand pane of RStudio) is in view, and that it is sized big enough to display the plot. We see that the flight volume has gradually increased over the years. Feel free to “Zoom” the plot to pop up a separate window. To close the popped-up window, press “Ctrl F4”.

image003

[code language=”r”]>barplot(summary(air$Year))¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† # Visualize it![/code]

Similarly, we can also examine the distribution of flights by airline (UniqueCarrier). We have 29 airlines in this dataset, including United (UA), Delta (DL), and many others.

[code language=”r”]>summary(air$UniqueCarrier)[/code]
9E     AA     AQ     AS     B6     CO     DH     DL     EA

526 15641   146   2981   877   8394   710 17397   952

EV     F9     FL     HA     HP ML (1)     MQ     NW     OH

1778   371   1289   269   3839     76   4040 10630   1547

OO PA (1)     PI     PS     TW     TZ     UA     US     WN

3270   346   900     92   4019   240 13872 14751 16464

XE     YV

2474   899

Again, visualizing the data will give us a better perspective. We will use Big R to aggregate the data, sort it, and then plot it. Can you tell which airlines have the most # of flights? Which ones have the least? Does this resonate with your own experience? Again, you can “Zoom” the plot window for a better view, and close it when done.

[code language=”r”]>barplot(sort(summary(air$UniqueCarrier)))[/code]

image004

Try summarizing some of the other columns (i.e. vectors) such as Distance, Orig, Dest, etc. What does this tell you?

[code language=”r”]>summary(air$Distance)[/code]
Min.   1st Qu.    Median     Mean   3rd Qu.     Max.

11.0000 307.0000 545.0000 701.9441 936.0000 4983.0000

__7.            Attaching bigr.frame to R search path

Notice how we used the expression “air$” to reference various columns. Similar to R’s data.frames, you can attach a bigr.frame (‘air’) to R search path. This will make it easy for us to reference columns without requiring a prefix.

[code language=”r”]>attach(air)[/code]

Slicing and dicing big data

__8.            Drilling down

Over the next few exercises, let’s drill down into the data and ask specific questions. These exercises will demonstrate that the Big R API on bigr.frames closely mirrors the R API on data.frames.

Of the 128790 flights in all, how many were flown by American, South West and Delta?

[code language=”r”]>length(UniqueCarrier[UniqueCarrier %in% c("AA", "WN", "DL")])[/code]
[1] 49502

How many flights were delayed by more than 15 minutes on departure?

[code language=”r”]>length(DepDelay[DepDelay >= 15])[/code]
[1] 20537

How many flights flew between San Francisco and LA?

[code language=”r”]>nrow(air[Origin %in% c("SFO", "LAX")

>& Dest %in% c("SFO", "LAX"),])[/code]

[1] 701

__9.            Detaching bigr.frame

Let us detach ‘air’ from the R search path. After this step, we will need to specify a complete prefix to reference columns in a bigr.frame.

[code language=”r”]>detach(air)[/code]

__10.        Selections and projections

Let us filter the data set for flights that were delayed by more than 15 minutes at departure or arrival. In addition, we will only project a few columns. We will call the new object “airSubset”. Note how this syntax is identical to an equivalent formulation on R’s data.frames. It looks and feels like we’re operating against data.frames, except we’re seamlessly going against data in BigInsights.

[code language=”r”]>airSubset <- air[air$Cancelled == 0

>& (air$DepDelay >= 15 | air$ArrDelay >= 15),

>c("UniqueCarrier", "Origin", "Dest",

>"DepDelay", "ArrDelay")][/code]

Examine the class of the newly created object. It is also a bigr.frame that’s been derived from the original “air” bigr.frame.

[code language=”r”]>class(airSubset)[/code]
[1] "bigr.frame"
[code language=”r”]>attr(,"package")[/code]
[1] "bigr"

__11.        Operations on derived frames

Big R does not actually materialize the derived dataset on the BigInsights server. The selections and projections are performed “on the fly” against the original data. In the following exercises, we’ll use the “airSubset” to perform some queries.

Examine the dimensions of the new frame. That’s 29230 rows and 5 columns.

[code language=”r”]>dim(airSubset)[/code]
[1] 29230     5

Examine 5 rows. Note that either the arrival or the departure delay is 15+ minutes.

[code language=”r”]>head(airSubset, 5)[/code]
UniqueCarrier Origin Dest DepDelay ArrDelay

1          CO   EWR FLL       9       15

2           CO   EWR ATL       -5       50

3           CO   IAH SFO       0       35

4           CO   EWR SFO       87     132

5           CO   IAH OMA       28       98

What percentage of flights were delayed overall? 22.7% is the answer.

[code language=”r”]>nrow(airSubset) / nrow(air)[/code]
[1] 0.2269586

What percentage of Hawaiian Airlines flights were delayed? 7.4% is much lesser than the system-wide delay of 22.7%. The results are not surprising. The islands have shorter flights and good weather, and these factors probably lend themselves to a lower delay rate. Again, note that we Big R expressions closely mirror equivalent R expressions.

[code language=”r”]>haFlightsDelayed <- airSubset[airSubset$UniqueCarrier %in%

>c("HA"),]

>haFlights <- air[air$UniqueCarrier == "HA",]

>nrow(haFlightsDelayed) / nrow(haFlights)[/code]

[1] 0.07434944

__12.        Sorting

Besides selections and projections, Big R supports other relational operations such as projecting derived columns, sorting, aggregations, joining and duplicate elimination. Some of these operations are covered in the later sections of this lab. For the moment, let us see how sorting works in Big R.

Which were the “longest” flights based on on distance flown? Here, Big R generates another derived bigr.frame (“bf”) that holds the result of the sort. Again, the sort itself is performed on the fly and the results are not materialized unless so desired.

[code language=”r”]>bf <- bigr.sort(air, by = air$Distance, decreasing = T)

>bf <- bf[,c("Origin", "Dest", "Distance")]

>class(bf)[/code]

[1] "bigr.frame"
[code language=”r”]>attr(,"package")[/code]
[1] "bigr"

Examine the top 6 rows. Not surprisingly, flights from the east coast cities to Hawaii are the longest.

[code language=”r”]>head(bf)[/code]
Origin Dest Distance

1   HNL JFK     4983

2   EWR HNL     4962

3   HNL EWR     4962

4   EWR HNL     4962

5   HNL EWR     4962

6   EWR HNL     4962

__13.        Aggregations

In the earlier exercises, we summarized bigr.frame and bigr.vectors. Big R provides a more powerful mechanism to compute specific aggregates. Using R’s formula notation, one can specify columns to aggregate along with any grouping constructs. An R formula is an expression of type “LHS ~ RHS”. For our purposes, on the LHS (left-hand-side), we specify the columns we’re interested in, and what aggregation functions need to be computed on those columns. On the RHS (right-hand-side), we specify any grouping. To compute aggregates on the entire data, use a dot (.).

What’s the mean flying distance and mean flying time for all airlines? If you know SQL, the following query is equivalent to “select avg(Distance), avg(CRSElapsedTime) from air”. It tells us that the average flight flew ~701 miles, and took about 2 hours.

[code language=”r”]>formula <- mean(Distance) + mean(CRSElapsedTime) ~ .

>summary(air, formula)[/code]

mean(Distance) mean(CRSElapsedTime)

1       701.9441             121.032

 

What is the # of flights, mean flying distance and mean flying time per airline? This yields a table of 29 airlines. The Distance is in miles, while the time is in minutes.

[code language=”r”]>summary(air, count(.) + mean(Distance) + mean(ActualElapsedTime)

>~ UniqueCarrier)[/code]

UniqueCarrier count(.) mean(Distance) mean(ActualElapsedTime)

1             B6     877     1196.9270               187.52098

2             MQ     4040      365.5876               84.24890

3             UA   13872       915.2260               144.93930

4             PS       92       376.5761               76.12088

5             HP     3839       762.0083               124.53314

6             DL   17397       706.1685               120.53761

7             FL     1289       665.8805               119.47402

8             NW   10630       710.4674               123.77893

9             EA     952       612.3038               106.34133

10           TZ      240     1118.9375               173.39916

11           AQ     146       269.5890               52.75352

12           OO     3270       385.0135               82.30714

13           F9     371       865.4960               141.65041

14           US   14751       580.0558               105.23018

15       ML (1)       76       683.2763               116.06849

16           TW     4019       724.5785               124.92718

17           XE     2474       530.9434               108.23721

18            9E     526       472.7357               100.70485

19       PA (1)     346       719.1826               129.79646

20           WN   16464       505.7515               87.84705

21           YV     899       397.1001               88.07365

22            DH     710       384.4845               88.75291

23           PI     900       386.4378               76.63229

24           EV     1778       447.4370               92.35731

25           CO     8394       897.9850               147.39600

26            OH     1547       474.2754               100.22416

27           HA     269       587.7398               92.83083

28           AA   15641       961.7579               153.78924

29           AS     2981       724.3296               119.70805

Visualizing big data

__14.        Visualizing data using box plots

Let’s use R’s visualization capabilities to plot the distribution of flying distance per airline. This plot provides us some of the same information we gathered a few minutes ago. The picture tells us that HA (Hawaiian) and AQ (Aloha) have the smallest median flying distance, which is expected because inter-island flights in Hawaii are around ~30 mins per flight.

[code language=”r”]>bigr.boxplot(air$Distance ~ air$UniqueCarrier) +

>labs(title = "Distance flown by Airline")[/code]

image005

__15.        Visualizing big data using heatmaps

How about producing a “heatmap” that shows how the flight volume was distributed across the calendar year. We’ll pick 3 years, say 2000, 2001 and 2002. But before we do so, let us load a plotting function that will produce the heatmap for us.

[code language=”r”]>source("calendarHeat.R")[/code]

Use Big R to filter the years we need.

[code language=”r”]>air2 <- air[air$Cancelled == 0

>& (air$Year == "2000" | air$Year == "2001" |

>air$Year == "2002"),][/code]

Summarize flight volume by Year, Month and Day

[code language=”r”]>df <- summary(air2, count(Month) ~ Year + Month + DayofMonth)[/code]

Build the “date” string as a separate column

[code language=”r”]>df$DateStr <- paste(df$Year, df$Month, df$DayofMonth, sep="-");[/code]

Plot the graph, and zoom it out. Do you notice any patterns? Any anomalies? Do you notice a white spot in the middle of the chart? Do you know why that spot is white?

[code language=”r”]>calendarHeat(df$DateStr, df[,4] * 100, varname="Flight Volume")[/code]

image006

__16.        Visualizing big data using histograms

Let’s examine the distribution of flights by hour. bigr.histogram() is a utility method that computes histogram statistics on BigInsights, and uses R package “ggplot2″ to render the plot. This plot shows the flight volume for every hour in the day. Notice how very few flights take off in the early morning hours. Majority of the flight volume comes between 7AM – 7PM, with the volume tapering off after 8 PM.

[code language=”r”]>bigr.histogram(air$DepTime, nbins=24) +

>labs(title = "Flight Volume (Arrival) by Hour")[/code]

image007

__17.        Visualizing big data using geographical maps

How about plotting the routes that have the most flights flown? First, let us load the functions that render a US map.

[code language=”r”]>source("mapRoutes.R")[/code]

To our original bigr.frame, “air”, add two columns that represent the city pairs.

[code language=”r”]>air2 <- air

>air2$City1 <- ifelse(air2$Origin < air2$Dest, air2$Origin, air2$Dest)

>air2$City2 <- ifelse(air2$Origin >= air2$Dest, air2$Origin, air2$Dest)[/code]

Compute the frequency of flights between all city pairs

[code language=”r”]>df <- summary(air2, count(UniqueCarrier) ~ City1 + City2)

>flights <- df[order(df[,3], decreasing=T), ][1:15,]

>print(flights)[/code]

City1 City2 count(UniqueCarrier)

1673   LAX   SFO                 701

1040   LAS   LAX                 605

1884   LAX   PHX                 584

2297   LAS   PHX                 509

1427   MSP   ORD                  507

2040   LGA   ORD                 475

2163   DAL   HOU                 459

695   DFW   ORD                 425

1549   EWR   ORD                 420

561   LAX   OAK                 406

920   BOS   LGA                 400

1869   ATL   ORD                  389

2083   DFW   IAH                 382

755   DCA   LGA                 378

1377   LAX   ORD                 369
[code language=”r”]>colnames(flights) <- c("airport1", "airport2", "cnt")[/code]

Plot the data! This plot works best if you were to maximize the size of the plot window before executing the following statement.

[code language=”r”]>mapRoutes(flights)[/code]

image008

__18.        Free experimentation

We’re at the end of this set of explorations. Use your knowledge of R and data.frames to formulate more queries against the “air” dataset. Alternatively, move to the next set of exercises in “BigRLab6.R”

Part 3 ‚Äď In-database analytics with Big R

The following exercises are available in a file named ‚ÄúBigRLab6.R‚ÄĚ in the ‚Äú/home/biadmin/labs-bigr/‚ÄĚ directory.

__1.            Back to RStudio

Return to RStudio session in the browser. When in Rstudio, you can use the F11 key to go into “Full Screen” mode. This will maximize your viewing area. Use F11 again to go back to the original view.

From within RStudio, run the following statements.

[code language=”r”]>setwd("~/labs-bigr")¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† ¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†# change directory

>rm(list = ls())                               # clear workspace

>library(bigr)                                 # Load Big R package

# Connect to BigInsights

>bigr.connect(host="localhost", port=7052, user="biadmin", password="biadmin")[/code]

Once again, define a bigr.frame over the “airline” dataset. Note how we can specify column types in this call.

[code language=”r”]>air <- bigr.frame(dataSource="DEL",

>dataPath="/user/biadmin/airline_lab.csv",

>coltypes=ifelse(1:29 %in% c(1,2,9,11,17,18,23),

>"character", "integer"))[/code]

Preparation for big data modeling

__2.            Preparation for modeling

Big R supports in-database analytics. This implies that instead of bringing large amounts of data to your R clients, Big R can ship R code to the BigInsights server and run it directly on the data.

To explore this concept, let’s build a simple set of “decision tree” models. Our models will predict flight arrival delay (ArrDelay) based on ‘DepDelay’, ‘DepTime’, ‘CRSArrTime’ and ‘Distance’ as the predictor variables. If this were a real-world exercise, the model will take many more variables into account. However, in the interest of time, we will limit our predictors to this small set.

To keep things simple, let us build the models for two airlines, say United and Hawaiian. Filter the airline data appropriately.

[code language=”r”]>airfilt <- air[air$UniqueCarrier %in% c("HA", "UA"),][/code]

__3.            Computing correlation matrix

Before we start building models, let’s see if there’s any correlation between the pairs of columns we’ve selected. We‚Äôll use Big R’s cor() method to compute Pearson’s correlation coefficients. This function is identical to the equivalent function on data.frames.

[code language=”r”]>corr <- cor(airfilt[,c("ArrDelay", "DepDelay",

>"DepTime", "CRSArrTime", "Distance")])[/code]

Print the correlation matrix. Looks like ArrDelay is strongly correlated (0.911403248) with DepDelay (which is to be expected). In addition, departure time (DepTime) and arrival time (ArrTime) are also somewhat correlated (0.69462876). Interestingly, the Distance flown has almost no bearing on the ArrDelay and DepDelay

[code language=”r”]>print(corr)[/code]
ArrDelay DepDelay     DepTime CRSArrTime     Distance

ArrDelay   1.000000000 0.9114032 0.17311562 0.10959000 -0.005877765

DepDelay   0.911403248 1.0000000 0.18607005 0.11513735 0.011089301

DepTime     0.173115625 0.1860701 1.00000000 0.69462876 -0.018451079

CRSArrTime 0.109589998 0.1151373 0.69462876 1.00000000 0.033232059

Distance   -0.005877765 0.0110893 -0.01845108 0.03323206 1.000000000

__4.            Make training and test sets

When building models, an established practice is to use only a subset of the data to train the model, while the rest is used for testing/validation. Using random sampling support in Big R, let us split the data in “airfilt” into training set (~70%) and test set (~30%).

[code language=”r”]>splits <- bigr.sample(airfilt, c(0.7, 0.3))[/code]

__5.            Examine the splits

We now have two bigr.frame objects that represents the training and test sets.

[code language=”r”]>class(splits)¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† # splits is a "list" …[/code]
[1] "list"
[code language=”r”]>length(splits)¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† # … with two elements[/code]
[1] 2

Define two new variables for the two bigr.frames.

[code language=”r”]>train <- splits[[1]]

>test <- splits[[2]][/code]

Check the class of the objects

[code language=”r”]>class(train)[/code]
[1] "bigr.frame"

attr(,"package")

[1] "bigr"
[code language=”r”]>class(test)[/code]
[1] "bigr.frame"

attr(,"package")

[1] "bigr"

Check if we roughly got the right split percentages.

[code language=”r”]>nrow(train) / nrow(airfilt)¬†¬†¬†¬†¬†¬†¬†¬† # Approximately 70%[/code]
[1] 0.6986776
[code language=”r”]>nrow(test) / nrow(airfilt)¬†¬†¬†¬†¬†¬†¬†¬† # Approximately 30%[/code]
[1] 0.3013224

Model building using in-Hadoop execution

__6.            Decision tree model

Let us write an R function that builds a decision-tree model.

# Line 1: Define the function signature. The function takes on a parameter, df. “df” represents the data that will be used to build the model.

# Line 2: The decision-tree algorithm comes to us from the open-source package “rpart”. This package has been previously installed on the machine. This package needs to be loaded from within the function.

# Lines 3-4: Define all the columns we’re interested in. This includes the reponse variable (“ArrDelay”) and the predictors.

# Line 5: Build the model. The expression “ArrDelay ~ .” is an R formula. It indicates to R that ArrDelay is the response variable, while every other column (.) is a predictor. The expression “df[,predcols]” projects only the needed columns.

# Line 6: The model, which is an object of class “rpart”, is returned by the function.

 

[code language=”r”] buildmodel <- function(df) {¬†¬† ¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†# line 1
library(rpart)                                         # line 2
predcols <- c(‘ArrDelay’, ‘DepDelay’, ‘DepTime’,¬†¬†¬†¬†¬†¬† # line 3
‘CRSArrTime’, ‘Distance’)¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† # line 4
model <- rpart(ArrDelay ~ ., df[,predcols])            # line 5
return(model)                                          # line 6
}
[/code]

__7.            Partitioned execution using groupApply

Now, we will build several decision-tree models, one per airline. If you are familiar with R’s “apply” functions, including lapply() and tapply(), the Big R’s groupApply() will look familiar to you. groupApply() needs three things – the data, the grouping columns, and the R function to call.

Line 1: We’re building models on the training set, i.e., “train”

Line 2: Since we’re interested in building one model per airline, we group by UniqueCarrier

Line 3: The R function we created above

Run the following lines. It will take around 10-15 seconds to build the models on a single-node cluster such as the one you’re running on.

 

[code language=”r”] models <- groupApply(data = train,¬†¬†¬†¬†¬†¬†¬†¬†# line 1
groupingColumns = train$UniqueCarrier, # line 2
rfunction = buildmodel)               # line 3
[/code]

groupApply ran for 11.54secs

__8.            Examining the models

groupApply() builds the models and stores them on BigInsights itself. We’re presented with a bigr.list object that provides us access to the models.

[code language=”r”]>class(models)[/code]
[1] "bigr.list"

attr(,"package")

[1] "bigr"

The bigr.list, ‘models’, has two elements in it. The “group” column indicates the value of the grouping column. The “status” column indicates whether or note the execution of the R function was successful on that group

[code language=”r”]>print(models)[/code]
bigr.list

group1 status

1     HA     OK

2    UA     OK

__9.            Pulling models to the client

We can pull one or both models from BigInsights. This statement brings the model for Hawaiian Airlines (models$HA) from the server and loads it into the memory of your current R session. Note how the grouping column values can be used to reference elements of the bigr.list (models).

 

[code language=”r”]>modelHA <- bigr.pull(models$HA)[/code]

Examine the model we retrieved from the cluster

[code language=”r”]>class(modelHA)[/code]
[1] "rpart"

 

[code language=”r”]>print(modelHA)[/code]
n=192 (3 observations deleted due to missingness)

 

node), split, n, deviance, yval

* denotes terminal node

 

1) root 192 30159.9200 -1.7291670

2) DepDelay< 16 185 15144.1800 -3.3189190

4) DepDelay< -2.5 126 6080.3250 -6.3412700

8) Distance>=2585 7 1792.8570 -18.1428600 *

9) Distance< 2585 119 3255.1760 -5.6470590

18) Distance< 1307 112 1424.4290 -6.1785710

36) DepDelay< -6.5 31   476.1935 -8.8387100 *

37) DepDelay>=-6.5 81   644.9136 -5.1604940 *

19) Distance>=1307 7 1292.8570   2.8571430 *

5) DepDelay>=-2.5 59 5454.9150   3.1355930

10) Distance< 1307 49 1079.3470   0.8163265

20) DepDelay< 0.5 28   275.0000 -1.5000000 *

21) DepDelay>=0.5 21   453.8095   3.9047620 *

11) Distance>=1307 10 2820.5000 14.5000000 *

3) DepDelay>=16 7 2191.4290 40.2857100 *

Visualize the model. You may want to enlarge the RStudio plot window so the graph fits. Note that the model has “strong” references to DepDelay, and weak references to some of the other predictors. Some of the columns are not even included in the model. What this is seemingly telling us is that flight arrival delay is mostly dependent on whether the flight was late taking off.

[code language=”r”] >source("~/labs-bigr/prettyTree.R")
prettyTree(modelHA)
[/code]

image009

Model scoring

__10.        Making predictions РWrite the scoring function

Now that our models have been created, we need to use them to make predictions. Let’s write another function that scores our models. In our case, scoring involves predicting arrival delay (ArrDelay) for flights.

Study the following function carefully:

Line 1: The function signature takes two parameters. ‘df’ is the data that we’re scoring. ‘models’ is a bigr.list that contains the models

Line 2: Load the library rpart

Line 3: Extract out model that represents our carrier

Line 4: Load the model from BigInsights and materialize it in R’s memory

Line 5: Do the actual prediction on each row in ‘df’ using ‘model’

Line 6: Return one row for each input flight. Each row has the following columns – Carrier, DepDelay, ArrDelay, Predicted Arrival Delay

[code language=”r”] >scoreModels <- function(df, models) { ¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†# line 1
>library(rpart)                                  # line 2
>carrier <- df$UniqueCarrier[1]                  # line 3
>model <- bigr.pull(models[carrier])             # line 4
>prediction <- predict(model, df)                # line 5
>return(data.frame(carrier, df$DepDelay, df$ArrDelay, prediction))  #line 6
}
[/code]

 

__11.        Making predictions РRun the scoring

Since we’ve built one model per airline, it makes sense for us to partition the test set by airline as well. Therefore, we’ll use groupApply() again, and use the same grouping column as before (UniqueCarrier). As an added twist, we will subpartition each group in 2 batches. This batching allows Big R to split the input group into smaller chunks that can be easily managed by the R instances that are spawned on the server.

Add a new column to “test”. This column will have a value of either 0 or 1.

[code language=”r”]>test$batch <- as.integer(bigr.random() * 2)[/code]

Check that we have 4 groups, two for each airline. Also check that each batch has approximately the same # of rows.

[code language=”r”]>summary(count(.) ~ UniqueCarrier + batch, object = test)[/code]
UniqueCarrier batch count(.)

1           HA     0       36

2          UA     1     2082

3           UA     0     2105

4           HA     1       38

Execute the groupApply function.

Line 1: The input data is the test set

Line 2: List of grouping columns, UniqueCarrier + Batch

Line 3: The scoring function to invoke on each group

Lines 4-8: The “shape” of the output of the scoring function. Earlier, we defined the scoring function to return a data.frame with a specified number of columns and their corresponding types. That information needs to be provided to groupApply() as well.

Line 9: Parameter that will be passed on from groupApply() to “scoreModels”. In this case, it’s our models object.

This call will take 15-20 seconds. Internally, BigInsights spawns 4 R instances, one for each partition. Since we’re on a single-node cluster these instances work in a serial fashion.

 

[code language=”r”] >preds <- groupApply(test,¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† # line 1
>list(test$UniqueCarrier, test$batch), # line 2
>scoreModels,                           # line 3
>signature=data.frame(carrier=’Carrier’, # line 4
>DepDelay=1.0,
>ArrDelay=1.0,
>ArrDelayPred=1.0,
>stringsAsFactors=F), # line 8
>models)                               # line 9
[/code]
groupApply ran for 20.04secs

The predictions are materialized on the cluster. As we’ve seen in other instances, Big R returns a bigr.frame object that holds the rows. Let’s examine the dimensions and contents of our predictions.

We should have the same # of rows in “preds” as we have in the “test” set.

[code language=”r”]>print(nrow(preds))[/code]
[1] 4261
[code language=”r”]>print(nrow(test))[/code]
[1] 4261

Examine 5 predictions from the top. See what the actual “ArrDelay” was, and what our models predicted (“ArrDelayPred”). Do our predictions sound reasonable? It’s hard to say with just 5 rows.

[code language=”r”]>head(preds, 5)[/code]
carrier DepDelay ArrDelay ArrDelayPred

1     HA       -5       -8   -5.160494

2     HA       -3       -8   -5.160494

3     HA       -9       -6     2.857143

4     HA       -5       -8   -5.160494

5     HA       -1       -6   14.500000

 

__12.        Check model quality

To assess our models accurately, we’ll rely on the frequently used metric called RMSD. RMSD (root mean squared deviation) is a measure of the differences between values predicted by a model and the values actually observed.

Wit Big R, we can easily compute RMSD using the following expression executed against the “preds” bigr.frame that resides in BigInsights.

 

[code language=”r”]>rmsd <- sqrt(sum((preds$ArrDelay – preds$ArrDelayPred) ^ 2) / nrow(preds))[/code]

What’s our RMSD?

[code language=”r”]>print(rmsd)[/code]
[1] 15.24068

Lastly, let’s examine some rows where our model was the most wrong.

[code language=”r”] >preds$error <- abs(preds$ArrDelay – preds$ArrDelayPred)
>head(bigr.sort(preds, preds$error, decr=T))
[/code]
carrier DepDelay ArrDelay ArrDelayPred   error

1     UA     179     346   177.323232 168.6768

2     UA       -3     147   -3.355674 150.3557

3     UA       13     152     6.303480 145.6965

4     UA     139     313   177.323232 135.6768

5     UA       36     176   42.142553 133.8574

6     UA     457     457   333.266667 123.7333

We get an RMSD measure of 14-15 minutes, and our model is off by a lot on certain rows. We may be able to improve the model by using additional predictors such as departure and arrival cities (some airports are worse than others), day of the week (weekends have different traffic patterns than weekdays), etc. It is debatable whether our original data even has all of the predictors that affect arrival delay. For the moment though, we’re done building and testing our model.

__13.        Free experimentation

Use you knowledge of groupApply() to execute your own R functions on data in BigInsights. Be careful, though! Your machine is only a single-node cluster so BigInsights will need to serialize the processing of each partition. This is precisely the reason why we chose to limit our model building to two airlines. On a more powerful cluster with many cores and nodes, Big R will automatically parallelize groupApply() to exploit the full cluster.

image010

4 comments on"Data Science Using Big R for In-Hadoop Analytics tutorial"

  1. I have a BigInsights VM 3.0 version. I am trying to work with prediction modelling using BigR given in IBM tutorial on BigR .
    Unable to load DMwR package and rpart for doing Data Mining Algorithms

    How to resolve it.

Join The Discussion

Your email address will not be published. Required fields are marked *