A few months ago, I published Sentiment Analysis of Reddit AMAs, which explained how to grab a reddit Ask Me Anything (AMA) conversation and export its data for analysis using our Simple Data Pipe app. From there, I used the Spark-Cloudant Connector, and Watson Tone Analyzer to get insights into writer sentiment.

Recently, I followed up with a similar exercise, but this time performing analysis using dashDB data warehouse and R.

I’m back at it again, to share my excitement over SparkR, an R API for Apache Spark. Analysis using SparkR lets you create a full working notebook really fast and iterate with ease. In this tutorial, I connect to our Cloudant database using a handy new CouchDB R package, fetch all json documents, create a SparkR dataframe, analyze with SQL and SparkR, then plot results with R.

Here’s the flow:

Before you begin

If you haven’t already, read my earlier Sentiment Analysis of Reddit AMAs blog post, so you understand what we’re up to here. You’ll get the background you need, and we can dive right in to this alternate analysis approach. (You don’t need to follow that earlier tutorial, nor the follow-up on dashDB + R in order to implement this SparkR solution. All the steps you need are here in this blog post.)

Deploy Simple Data Pipe

The fastest way to deploy this app to Bluemix (IBM’s Cloud platform) is to click the Deploy to Bluemix button, which automatically provisions and binds the Cloudant service too. (Bluemix offers a free trial, which means you can try this tutorial out for free.)

Deploy to Bluemix

If you would rather deploy manually, or have any issues, refer to the readme.

When deployment is done, click the EDIT CODE button.

Install reddit Connector

Reddit logo

Since we’re importing data from reddit, you need to establish a connection between reddit and Simple Data Pipe.

Note: If you have a local copy of Simple Data Pipe, you can install this connector using Cloud Foundry.

  1. In Bluemix, at the deployment succeeded screen, click the EDIT CODE button.
  2. Click the package.json file to open it.
  3. Edit the package.json file to add the following line to the dependencies list:
    "simple-data-pipe-connector-reddit": "^0.1.2"
    Tip: be sure to end the line above with a comma and follow proper JSON syntax.
  4. From the menu, choose File > Save.

    save

  5. Press the Deploy app button and wait for the app to deploy again.

    deploy button

Add Services in Bluemix

To work its magic, the reddit connector needs help from a couple of additional services. In Bluemix, we’re going analyze our data using the Apache Spark and Watson Tone Analyzer services. So add them now by following these steps:

Provision IBM Analytics for Apache Spark Service

  1. On your Bluemix dashboard, click Work with Data.
  2. Click New Service.
  3. Find and click Apache Spark then click Choose Apache Spark
  4. Click Create.

Provision Watson Tone Analyzer Service

  1. In Bluemix, go to the top menu, and click Catalog.
  2. In the Search box, type Tone Analyzer, then click the Tone Analyzer tile.
  3. Under app, click the arrow and choose your new Simple Data Pipe application. Doing so binds the service to your new app.
  4. In Service name enter only tone analyzer (delete any extra characters)
  5. Click Create.
  6. If you’re prompted to restage your app, do so by clicking Restage.

Load reddit Data

  1. Launch simple data pipe in one of the following ways:
    • If you just restaged, click the URL for your simple data pipe app.
      open url button
    • Or, in Bluemix, go to the top menu and click Dashboard, then on your Simple Data Pipe app tile, click the Open URL button.
      open url button
  2. In Simple Data Pipe, go to menu on the left and click Create a New Pipe.

  3. Click the Type dropdown list, and choose Reddit AMA.
    Type dropdown
    When you added a reddit connector earlier, you added the Reddit option you’re choosing now.
  4. In Name, enter ibmama, or whatever you wish.

  5. If you want, enter a Description.

  6. Click Save and continue.

  7. Enter the URL for the reddit conversation you want to analyze. You’re not limited to using an AMA conversation here. You can enter the URL of any reddit conversation, including the IBM-hosted AMA we used in earlier tutorials: https://www.reddit.com/r/IAmA/comments/3ilzey/were_a_bunch_of_developers_from_ibm_ask_us

  8. Click Connect to AMA.
    You see a You’re connected confirmation message.

  9. Click Save and continue.

  10. On the Filter Data screen, make the following 2 choices:

    • under Comments to Load, select Top comments only.
    • under Output format, choose JSON flattened.

    Then click Save and continue.

    Why flattened JSON? Flat JSON format is much easier for Apache Spark to process, so for this tutorial, the flattened option is the best choice. If you decide to use the Simple Data Pipe to process reddit data with something other than Spark, you probably want to choose JSON to get the output in its purest form.

  11. Click Skip, to bypass scheduling.

  12. Click Run now.

    When the data’s done loading, you see a Pipe Run complete! message.

  13. Click View details.

    Simple Data Pipe Activity Run now

Analyze reddit Data

Create new R Notebook

Create a notebook on IBM’s Data Science Experience (DSX):

  1. Sign in or create a trial account on DSX.
  2. Create a new project (or select an existing project).

    On the upper right of the screen, click the + plus sign and choose Create project.

  3. Add a new notebook (From URL) within the project.
    1. Click add notebooks.
    2. Click From URL.
    3. Enter any name, and under Notebook URL enter https://github.com/ibm-cds-labs/reddit-sentiment-analysis/raw/master/couchDB-R/Preview-R-couchDB.ipynb
    4. Click Create Notebook
  4. Copy and enter your Cloudant credentials.

    In a new browser tab or window, open your Bluemix dashboard and click your Cloudant service to open it. From the menu on the left, click Service Credentials. If prompted, click Add Credentials. Copy your Cloudant host, username, and password into the corresponding places in cell 4 of the notebook (replacing XXXX’s).

Run the code and generate reports

  1. Install CouchDB R package

    Run cells 1 and 2 to install the CouchDB package and library. You need to run these only once. Read more about the package.

  2. Define a variable sqlContext to use existing Spark (sc) and SparkRSQL Context that is already initialized with IBM Analytics for Apache Spark as Service.

    In [3]: sqlContext <- sparkRSQL.init(sc)

  3. Run cell 4 to connect to Cloudant.

  4. Run cell 5 to get a list of Cloudant databases.

    In [5]: couch_list_databases(myconn)

    Out [5]:

        'pipe_db' 'reddit_sparkr_top_comments_only' 
  5. Then read this connection by running the next cell:

    In [6]: print(myconn)

Create a SparkR dataframe from a Cloudant database

There is no magic function that gets desired documents into a ready-to-use SparkR dataframe. Instead, the function couch_fetch() retrieves a document object with value based on a key. At this point in the code, I don’t have keys in hand. Thanks to the primary index _all_docs that comes with Cloudant databases, there’s no need to write extra code. Simply add a forward slash / and
_all_docs to the database name. (To learn more, read https://cloudant.com/for-developers/all_docs/.)

  1. Use _all_docs to fetch all documents from the Cloudant database (mine is named reddit_regularreddit_top_comments_and_replies) and create a data frame by running the following command:

    In[7]:

        results <- couch_fetch(myconn, database = "reddit_regularreddit_top_comments_and_replies/_all_docs", key = NULL, myOpts = NULL)
        results_df <- data.frame(results)
        df <- createDataFrame(sqlContext, results_df)

    Note: Insert your database name in this cell. You can find it in results from running couch_list_databases(myconn) 2 cells before:

    db name

    About SparkR and R Dataframes

    “SparkR is based on Spark’s parallel DataFrame abstraction. Users can create SparkR DataFrames from “local” R data frames, or from any Spark data source such as Hive, HDFS, Parquet or JSON.”
    -Spark 1.4 Announcement

    You can create a SparkR dataframe from R data by calling function createDataFrame() or as.DataFrame() both will do the job. In this case, I used
    createDataFrame(sqlContext, data)
    where data is R dataframe or a list, and it returns a DataFrame.

    Alternatively you can download the content of a SparkDataFrame into an R’s data.frame, by calling function
    as.data.frame() (all lowercase). So,
    as.data.frame(x)
    where x is DataFrame, returns a data.frame.

    Tip: Learn more about R data type by calling function
    typeof(x) where x is a R data type, either a matrix or
    list or vector or data.frame.

    For detailed API http://spark.apache.org/docs/latest/api/R/index.html

  2. Print the schema that you just created by running the next cell:

    In [8]: printSchema(df)

    which returns:

         root
         |-- total_rows: integer (nullable = true)
         |-- offset: integer (nullable = true)
         |-- rows_id: string (nullable = true)
         |-- rows_key: string (nullable = true)
         |-- rows_rev: string (nullable = true)
         |-- rows_id_1: string (nullable = true)
         |-- rows_key_1: string (nullable = true)
         |-- rows_rev_1: string (nullable = true)
         |-- rows_id_2: string (nullable = true)
         |-- rows_key_2: string (nullable = true)
         |-- rows_rev_2: string (nullable = true)
         |-- rows_id_3: string (nullable = true)
         |-- rows_key_3: string (nullable = true)
         |-- rows_rev_3: string (nullable = true)

    The first row is the _design_ document so ignore. All that follows is reddit data.

  3. Run the typeof(results) command, which returns 'list'

  4. Print the list results returned by couch_fetch().

    In [10]: print(results)

        $total_rows
        [1] 4
    
        $offset
        [1] 0
    
        $rows
        $rows[[1]]
        $rows[[1]]$id
        [1] "_design/Top comments and replies"
    
        $rows[[1]]$key
        [1] "_design/Top comments and replies"
    
        $rows[[1]]$value
        $rows[[1]]$value$rev
        [1] "1-edc6f6bb0062260ecf1160c81872efdd"
    
        $rows[[2]]
        $rows[[2]]$id
        [1] "f4f7cfa487898608fff6eb639fe6ed26"
    
        $rows[[2]]$key
        [1] "f4f7cfa487898608fff6eb639fe6ed26"
    
        $rows[[2]]$value
        $rows[[2]]$value$rev
        [1] "1-c0be345c89577577cdeb301328d9e4f5"
        .....
  5. Next, iterate over the list of keys returned, fetch individual documents, create a R dataframe, add each document as a row to the dataframe and create a new SparkR dataframe.

    In [11]:

        keys_list <- data.frame(results)
        rows_df_2 <- data.frame()
        for (i in 1:(keys_list[,'total_rows'] - 1) ){
    
            key <- paste('rows.key.',i,sep="")
            docs <- couch_fetch(myconn, database = "reddit_regularreddit_top_comments_and_replies", key = keys_list[,key], myOpts = NULL)
            rows_df <- data.frame(docs)
            rows_df_2 <- rbind(rows_df_2,rows_df)
            }
    
        df2 <- createDataFrame(sqlContext, rows_df_2)
        printSchema(df2)
        showDF(df2)
    

Output looks like

output_lg

Analyze reddit data with SparkR SQL and Plot with R

Now we’ll create a bar chart showing comment count by sentiment (for comments scoring higher than 70%).

In [12]:

    registerTempTable(df2,"reddit")
    sentimentDistribution <- list()
    columns <- colnames(df2[,9:21])
    count <- list()

    for(i in 1:length(columns)){
        query  70')
        df3 <- sql(sqlContext,query)
        collected <- collect(df3)
        count[columns[i]] <- collected$sentCount
    }

    distribution  70% in IBM Reddit AMA",col=139, ylim=c(0,130),cex.axis=0.5,cex.names=0.5,ylab="Reddit comment count")

HistogramSparkRCouchDB

Filter reddit data with SparkR and Print report to notebook

You don’t need to run SQL queries to work with Spark Dataframes. Dataframes have functions like filter, select, grouping, and aggregation. filter() returns rows and select() returns columns that meet the condition passed in the input. In the following code, filter() returns a Dataframe containing rows that have an emotion score higher than 70%. select() returns author (redditor – reddit user) and text (comments by redditors) from the Dataframe returned earlier. We could use SparkR Dataframe functions head() and showDF() to show a quick data overview. But since we want a full list of comments by sentiment with that high emotional score (>70%), we call R print() function.

In [13]

    for(i in 1:length(columns)){
    columnset  70') )
    if(count(columnset) > 0){
    print('----------------------------------------------------------------')
    print(columns[i])
    print('----------------------------------------------------------------')
    comments <- as.data.frame(select(columnset,"author","text"))
      for(j in 1:count(columnset)){
            print(paste("Author: ",comments[j,1]))
            print(paste("Comments: ",comments[j,2]))
        }
    }
}
 

Results show comments grouped by sentiment. Some comments appear under multiple sentiment categories. For example, the question Are you ashamed of Lotus Notes? appears both under Disgust and Extraversion. You can scroll through the list.

CommentsBySentiment

Try loading a different reddit conversation

Launch your Simple Data Pipe app again and return to the Load reddit Data section. In step 7, swap in a different URL, run the notebook again, and check out the results.

Conclusion

If you’re an R fan, you’ll appreciate that SparkR provides a handy R frontend for Spark. What’s great about moving data from JSON document to a SparkR or R dataframe, is that the data structure pretty much remains the same. That offers flexibility to write SparkR notebooks fast and makes it easy to move data in and out of SparkR and R dataframes. Both offer powerful operations that produce informative analytics with high performance. R and JSON: made for each other.

Join The Discussion

Your email address will not be published. Required fields are marked *