A few months ago, I published Sentiment Analysis of Reddit AMAs which explained how to grab a reddit Ask Me Anything (AMA) conversation and export its data for analysis using our Simple Data Pipe app. From there, I used the Cloudant-Spark Connector, and Watson Tone Analyzer to get insights into writer sentiment. The flow looked like this:

architectural diagram

But you have lots of options for the analysis portion of this exercise. How about a data warehouse instead? When you run Simple Data Pipe, the reddit AMA lands in Cloudant. From there, it’s a breeze to send data on to a dashDB data warehouse, where you can run analytics.

In this tutorial, I’ll perform analysis with R. Lots of people love R because it’s FREE and features lots of great statistics libraries that make it easy to analyze and visualize your data. It also comes built-in to dashDB. Handy!

Here’s the new alternate flow:

reddit AMA analytics in dashDB+R

Before you begin

If you haven’t already, read my earlier Sentiment Analysis of Reddit AMAs blog post, so you understand what we’re up to here. You’ll get the background you need, and we can dive right in to this alternate analysis approach. (You don’t need to follow that earlier tutorial to implement this dashDB + R solution. All the steps you need are here in this blog post.)

Deploy Simple Data Pipe

The fastest way to deploy this app to Bluemix is to click the Deploy to Bluemix button, which automatically provisions and binds the Cloudant service too.

Deploy to Bluemix

If you would rather deploy manually, or have any issues, refer to the readme.

When deployment is done, click the EDIT CODE button.

Install reddit Connector

Reddit logo

Since we’re importing data from reddit, you need to establish a connection between reddit and Simple Data Pipe.

Note: If you have a local copy of Simple Data Pipe, you can install this connector using Cloud Foundry.

  1. In Bluemix, at the deployment succeeded screen, click the EDIT CODE button.
  2. Click the package.json file to open it.
  3. Edit the package.json file to add the following line to the dependencies list:
    "simple-data-pipe-connector-reddit": "^0.1.2"
    Tip: be sure to end the line above with a comma and follow proper JSON syntax.
  4. From the menu, choose File > Save.

    save

  5. Press the Deploy app button and wait for the app to deploy again.

    deploy button

Add Watson Tone Analyzer

To work its magic, the reddit connector needs help from Watson Tone Analyzer. So add this service now by following these steps:

  1. In Bluemix, go to the top menu, and click Catalog.
  2. In the Search box, type Tone Analyzer, then click the Tone Analyzer tile.
  3. Under app, click the arrow and choose your new Simple Data Pipe application. Doing so binds the service to your new app.
  4. In Service name enter only tone analyzer (delete any extra characters)
  5. Click Create.
  6. If you’re prompted to restage your app, do so by clicking Restage.

Load the reddit AMA Data

  1. Launch simple data pipe in one of the following ways:
    • If you just restaged, click the URL for your simple data pipe app.
      open url button
    • Or, in Bluemix, go to the top menu and click Dashboard, then on your Simple Data Pipe app tile, click the Open URL button.
      open url button
  2. In Simple Data Pipe, go to menu on the left and click Create a New Pipe.

  3. Click the Type dropdown list, and choose Reddit AMA.
    Type dropdown
    When you added a reddit connector earlier, you added the Reddit option you’re choosing now.
  4. In Name, enter ibmama.

  5. If you want, enter a Description.

  6. Click Save and continue.

  7. Enter the URL for the AMA. We’ll use the sample IBM-hosted AMA explored at length my earlier post: https://www.reddit.com/r/IAmA/comments/3ilzey/were_a_bunch_of_developers_from_ibm_ask_us

  8. Click Connect to AMA.
    You see a You’re connected confirmation message.

  9. Click Save and continue.

  10. On the Filter Data screen, make the following 2 choices:
    • under Comments to Load, select Top comments only.
    • under Output format, choose JSON flattened.

    Then click Save and continue.

    Why flattened JSON? We plan to run some simple SQL queries on this data once we get it into dashDB, so flattened is the way to go, since each JSON document maps to a single row in one table.

  11. Click Skip, to bypass scheduling.

  12. Click Run now.

    When the data’s done loading, you see a Pipe Run complete! message.

  13. Click View details.

    Simple Data Pipe Activity Run now

Move the AMA data to dashDB

  1. On your Simple Data Pipe’s Activity page, Click the run’s Details link.

    detailslink

  2. Click the Top comments only link.
    topcommentslink
  3. Find your Cloudant password.
  4. In another browser tab, return to your Bluemix dashboard. Click your Cloudant service to open it. In the menu on the left, click Service Credentials and copy your password.
    Cloudant creds

  5. Return the broswer tab where you had Cloudant open. In the Password field, paste your Cloudant service password and click Sign in.
  6. From the Cloudant menu, choose Warehousing > Create a dashDB Warehouse.

  7. Enter your Bluemix credentials and click Authenticate in Bluemix.
  8. In Warehouse Name enter a name for your warehouse (doesn’t matter what you name it).
  9. In the Data Sources field, type an r and you see a choice appear called reddit_ibmama_top_comments_only. Select it and click Create Warehouse.
    When the system finishes provisioning your dashDB instance, you see your database appear.

  10. Click the Open in dashDB button.
  11. Review the table definition produced by the Simple Data Pipe.

    In the menu on the left side of the dashDB screen, click Tables. From the dropdown on the upper right, choose REDDIT_IBMAMA_TOP_COMMENTS_ONLY. You’ll see an emotions list that forms a base for a simple SQL query that we will run next.

Analyze AMA Data

dashDB offers a few different ways to work with data ranging from a simple SQL query interface to built-in R scripting and R Studio features.

Try a SQL Query

  1. On the left side of the dashDB screen, click the Run SQL tab.

  2. Delete the sample queries you see in the box and replace it with the following query (if your table name is not REDDIT_IBMAMA_TOP_COMMENTS_ONLY, replace that string with your table name):
     SELECT AUTHOR,TEXT FROM REDDIT_IBMAMA_TOP_COMMENTS_ONLY WHERE ANALYTICAL > 70
  3. Click the Run button.

    When it’s finished, the bottom of the screen shows a Succeeded message.

  4. Scroll down and you see a list of authors and their text entries from reddit. Sweet!

    query results: authors

Write an R script

dashDB offers you 2 ways to work with R:

  • dashDB web console. You can write and save queries directly in the dashDB interface.
  • R Studio. dashDB includes an integrated R studio environment, which runs directly on the Bluemix platform.

Both methods let you save and store scripts. Let’s start with a simple R query:

  1. From the menu, choose Analytics > R Scripts.
  2. Click the + plus sign to get a quick helper dialog box, which lets you pick data:

    r scripting table select

    1. Choose your ibmama table, then columns appear.
    2. Above the Columns list, click Clear all then turn on the Analytical checkbox and click the Apply button.

    You see the R query that dashDB writes.

  3. Add the following line to the query:

    df1465936166363 # prints dataframe contents to the console

    rquery_line1

  4. Click Submit.

  5. When the query finishes, dashDB lands you on the Console Output tab. Scroll down to see the entire output, which looks like this:

       Console Output:
        ANALYTICAL
    1         0.00
    2        32.40
    3         0.00
    .....
    395       0.00
    396       0.00
    Console Messages:
    Loading required package: RODBC
    Loading required package: ibmdbR
    Loading required package: methods
    Loading required package: MASS
    Loading required package: grDevices
    Loading required package: graphics
    Loading required package: stats
    Loading required package: utils
    Loading required package: Matrix
    Loading required package: arules
    
    Attaching package: ‘arules’
    
    The following objects are masked from ‘package:base’:
    
        %in%, write
    
    Loading required package: rpart
    Loading required package: rpart.plot
    Loading required package: ggplot2
    Warning message:
    closing unused RODBC handle 1
    

Analyze in RStudio

For a more robust analysis environment, try running scripts in RStudio.

  1. Get dashDB credentials.

    To access RStudio, you need the credentials for your dashDB service. To get them: Leave the R scripts page open (you’ll return here in a minute), and in a new tab or window, visit your Bluemix dashboard. Click your dashDB service to open it. On the left side of the screen click Service Credentials. If necessary, click Add Credentials. Copy values for username and password.

  2. Return to dashDB’s R Scripts page, click the R Studio button.
    open R Studio

  3. Paste in the Username and password you just copied from dashDB credentials and click Sign In.

    RStudio opens.

    RStudio

  4. Generate a bar chart of comments by sentiment.

    Create a new R script by clicking + > R Script.

    new R script

    Then enter the following script, select all the text you just pasted (otherwise you’d have to run the script line-by-line), and click Run.

    library(ibmdbR)
    mycon <- idaConnect("BLUDB", "", "")
    idaInit(mycon)
    
     emotions <- c('AGREEABLENESS', 'ANALYTICAL', 'ANGER', 'CONFIDENT', 'CONSCIENTIOUSNESS', 'DISGUST', 'EMOTIONAL_RANGE', 'EXTRAVERSION', 'FEAR', 'JOY', 'OPENNESS', 'SADNESS', 'TENTATIVE')
    displaynames <- c('AGREEABLE', 'ANALYTICAL', 'ANGER', 'CONFIDENT', 'CONSCIENTIOUS', 'DISGUST', 'EMO RANGE', 'XVERSION', 'FEAR', 'JOY', 'OPEN', 'SAD', 'TENTATIVE')
    
    count <- list()
    
    for(i in 1:13){
      query  70')
    
    df <- idaQuery(query,as.is=F)
    nrow(df) # print number of rows
    ncol(df) # print number of columns
    df # print dataframe
    count[[i]] <- df[1,]
    }
    
    barplot(distribution,main="Histogram of comments by sentiments > 70% in IBM Reddit AMA",names.arg=displaynames, col=139, cex.axis=0.8,cex.names=0.7,xlab="Tone",ylab="Count")

    In the pane on the lower right of the screen, click the Plots tab to see the following bar chart:

    Reddit AMA Histogram

  5. Create an web page that lists comments by sentiment.

    Create a new R script by clicking + > R Script,

    new R script

    Then enter the following script, select all the text, and click Run.

    
    library(ibmdbR)
    mycon <- idaConnect("BLUDB", "", "")
    idaInit(mycon)
    
    library(xtable) # loads xtable package
    
    emotions <- c('AGREEABLENESS', 'ANALYTICAL', 'ANGER', 'CONFIDENT', 'CONSCIENTIOUSNESS', 'DISGUST', 'EMOTIONAL_RANGE', 'EXTRAVERSION', 'FEAR', 'JOY', 'OPENNESS', 'SADNESS', 'TENTATIVE')
    
    count <- list()
    
    ##
    ## Count number of times an emotion occurs with score more than 70%
    ##
    
    for(i in 1:13){
      query  70')
    
      df <- idaQuery(query,as.is=F)
      nrow(df) # print number of rows
      ncol(df) # print number of columns
      df       # print dataframe
      count[[i]] <- df[1,]
    }
    
    ##
    ## Create new matrix and provide matrix dimensions nrow x ncol to the data.frame() function. Populate matrix with Author
    ## and comments with sentiments that have emotional score more than 70%. Create a new xtable object and print it into an
    ## html file.
    ##
    
    for(i in 1:13){
      query  70')
    
      df <- idaQuery(query,as.is=F)
      nrow(df)
      ncol(df)
      comments <- data.frame(matrix(NA, nrow=count[[i]], ncol=2))
      comments <- data.frame(df)
      colnames(comments) <- c('AUTHOR', emotions[i])
      comments.table <- xtable(comments)
      print.xtable(comments.table, type="html", file = "RedditSentiment.html",append=TRUE)
    
    }
    
    idaClose(mycon)
    

    On the upper right of the screen, in the Environment tab, click Comments. RStudio shows a list of comments in the left pane:

    comments list

    To see the html page, go to the pane on the lower right of the RStudio screen and click the Files tab. Then click RedditSentiment.html.

    link to html page

    To see the sample html page I generated from this AMA conversation, copy and paste this URL into your browser’s address bar:

    http://htmlpreview.github.io/?https://github.com/ibm-cds-labs/reddit-sentiment-analysis/blob/master/R-artifacts/RedditSentiment.html

Tip: You can download and run the R script code from our CDS labs github repository https://github.com/ibm-cds-labs/reddit-sentiment-analysis/tree/master/dashdb-R.

Conclusion

This is just one of many analysis solutions you could apply to Simple Data Pipe output. In my previous post, I showed analysis with Spark and Cloudant. Now you see how easy it is to move Cloudant JSON data on to a dashDB warehouse, then use built-in analysis tool R to glean meaning and insights. Interested in SparkR? Read my follow-up, which shows how to connect to Cloudant via CouchDB then analyze the JSON with SparkR.

Try these AMAs

Launch your Simple Data Pipe app again and return to the Load reddit AMA Data section. In step 7, swap in one of these AMA URLs and check out the results.

  • Steve Wozniak
    https://www.reddit.com/r/IAmA/comments/4apj5f/im_apple_cofounder_steve_wozniak_ask_me_anything/
  • Chris Rock
    https://www.reddit.com/r/IAmA/comments/2pi16o/chris_rock_here_ama/
  • Tim Berners Lee
    https://www.reddit.com/r/IAmA/comments/2091d4/i_am_tim_bernerslee_i_invented_the_www_25_years/
  • Neil deGrasse Tyson
    https://www.reddit.com/r/IAmA/comments/qccer/i_am_neil_degrasse_tyson_ask_me_anything/
  • Bill Gates
    https://www.reddit.com/r/IAmA/comments/18bhme/im_bill_gates_cochair_of_the_bill_melinda_gates/
  • Louis C. K.
    https://www.reddit.com/r/IAmA/comments/n9tef/hi_im_louis_ck_and_this_is_a_thing/
  • Amy Poehler
    https://www.reddit.com/r/IAmA/comments/2kp7w0/im_amy_poehler_amaa/
  • IBM’s Chef Watson
    https://www.reddit.com/r/IAmA/comments/3id842/we_are_the_ibm_chef_watson_team_along_with_our/
  • Barack Obama
    https://www.reddit.com/comments/z1c9z/i_am_barack_obama_president_of_the_united_states/

Join The Discussion

Your email address will not be published. Required fields are marked *