Reddit recently announced a coffee-table book, Ask Me Anything: Volume One. It’s a collection of their favorite Ask Me Anything (AMA) web events, in which anyone can get online with luminaries like Bill Gates, Madonna, Chris Rock, Elon Musk, or President Obama and ask them any question that comes to mind.
While this is a gorgeous book, it’s missing one key element that makes AMA’s so valuable and rich: actionable data. When an AMA is online, you can access and analyze the text to glean insights from the discussion. The possibilities for interesting analyses are endless. For instance, check out this interactive graph that measures how people use language on reddit. Search for a term to see trends.
The book organizes AMAs in categories like Inspiring, Informative, Provocative, Fascinating, Beautiful, Courageous, Humorous, and Ingenious. Which category would you land in? We wondered the same thing about ourselves. In the spirit of eating our own dogfood (in every sense), we’ll explore this question using an AMA hosted by IBM developers and our home-grown analysis tools. Watson Tone Analyzer helps you understand how you’re coming across to others, so it’s perfect for this job.
Here’s how we built our own reddit AMA sentiment analysis solution (and you can too). In this tutorial, we:
- Take an IBM-hosted AMA.
- Load its data with our handy Simple Data Pipe, which leverages Bluemix (IBMâ€™s Cloud platform service) and runs Node.js to move JSON data from reddit (or another source), enriches the data with Watson Tone Analyzer, and lands results in Cloudant.
- Run commands in an iPython notebook to analyze the Cloudant JSON output, using Apache Spark to analyze the Watson Tone Analyzer-enriched data to gauge positive or negative emotions measured across multiple tone dimensions, like anger, joy, openness, and more.
Deploy Simple Data Pipe
The fastest way to deploy this app to Bluemix is to click the Deploy to Bluemix button, which automatically provisions and binds the Cloudant service too.
If you would rather deploy manually, or have any issues, refer to the readme.
When deployment is done, click the EDIT CODE button.
Install reddit Connector
Since we’re importing data from reddit, you need to establish a connection between reddit and Simple Data Pipe.
Note: If you have a local copy of Simple Data Pipe, you can install this connector using Cloud Foundry.
- In Bluemix, at the deployment succeeded screen, click the EDIT CODE button.
- Click the
package.jsonfile to open it.
- Edit the package.json file to add the following line to the
Tip: be sure to end the line above with a comma and follow proper JSON syntax.
- From the menu, choose File > Save.
- Press the Deploy app button and wait for the app to deploy again.
Add Services in Bluemix
To work its magic, the reddit connector needs help from a couple of additional services. In Bluemix, we’re going analyze our data using the Apache Spark and Watson Tone Analyzer services. So add them now by following these steps:
Provision IBM Analytics for Apache Spark Service
- Login to Bluemix (or sign up for a free trial).
- On your Bluemix dashboard, click Work with Data.
- Click New Service.
- Find and click Apache Spark then click Choose Apache Spark
- Click Create.
Provision Watson Tone Analyzer Service
- In Bluemix, go to the top menu, and click Catalog.
- In the Search box, type Tone Analyzer, then click the Tone Analyzer tile.
- Under app, click the arrow and choose your new Simple Data Pipe application. Doing so binds the service to your new app.
- In Service name enter only tone analyzer (delete any extra characters)
- Click Create.
- If you’re prompted to restage your app, do so by clicking Restage.
Load the reddit AMA Data
- Launch simple data pipe in one of the following ways:
- If you just restaged, click the URL for your simple data pipe app.
- Or, in Bluemix, go to the top menu and click Dashboard, then on your Simple Data Pipe app tile, click the Open URL button.
- If you just restaged, click the URL for your simple data pipe app.
- In Simple Data Pipe, go to menu on the left and click Create a New Pipe.
- Click the Type dropdown list, and choose Reddit AMA.
- In Name, enter ibmama.
- If you want, enter a Description.
- Click Save and continue.
- Enter the URL for the AMA. We’ll use the sample IBM-hosted AMA we mentioned earlier:
- Click Connect to AMA.
You see a You’re connected confirmation message.
Click Save and continue.
On the Filter Data screen, make the following 2 choices:
- under Comments to Load, select Top comments only.
- under Output format, choose JSON flattened.
Then click Save and continue.
Why flattened JSON? Flat JSON format is much easier for Apache Spark to process, so for this tutorial, the flattened option is the best choice. If you decide to use the Simple Data Pipe to process reddit data with something other than Spark, you probably want to choose JSON to get the output in its purest form.
- Click Skip, to bypass scheduling.
Click Run now.
When the data’s done loading, you see a Pipe Run complete! message.
Click View details.
Tip: You can review the processed reddit comments in Cloudant along with the enriched Tone Analyzer metadata by clicking the run’s Details link and then clicking the Top comments only link. If prompted, enter your Cloudant password.
Analyze AMA Data
Create new Python Notebook
Create a notebook on IBM’s Data Science Experience (DSX):
- Sign in or create a trial account on DSX.
- Create a new project (or select an existing project).
On the upper right of the screen, click the + plus sign and choose Create project.
- Add a new notebook (From URL) within the project.
- Click add notebooks.
- Click From URL.
- Enter notebook name.
- Enter notebook URL:
- Select your Spark Service.
- Click Create Notebook.
- Copy and enter your Cloudant credentials.
In a new browser tab or window, open your bluemix dashboard and click your Cloudant service to open it. From the menu on the left, click Service Credentials. If prompted, click Add Credentials. Copy your Cloudant
passwordinto the corresponding places in cell 3 of the notebook (replacing XXXX’s).
- Still in cell 3, at the end of the line, specify which cloudant database to load by making sure the following string includes name of the pipe you just created,
Edit this string to include the name you gave your pipe in the preceding section. The naming convention here is
- Leave this notebook open. We’ll run this code in a minute.
If prompted, select a kernel for the notebook. The notebook should successfully import.
When you use a notebook in DSX, you can run a cell only by selecting it, then on the Run Cell (▸ icon) button. If you don’t see the Run Cell button and Jupyter toolbar, go to the toolbar and click Edit.
About the Spark-Cloudant Connector
Before we run commands in the notebook, let’s peek under the hood. We use the Spark-Cloudant Connector, which lets you connect your Apache Spark instance to a Cloudant NoSQL DB instance and analyze the data. This is a great way to leverage Spark’s lightning-fast processing power directly on your Cloudant JSON data.
Run the Code and Generate Reports
New to notebooks? If you’ve never used a Python notebook before, here’s how you run commands. You must run cells in order from top to bottom. To run a cell, click it (a box appears around it) and in the menu above the notebook, click the Run button. While the command processes, an * asterisk appears (for a moment or a few minutes) in place of the number. When the asterisk disappears, and the number returns, processing is done, and you may move on to the next cell.
Now you can run the code in each notebook cell. Here’s what you’re doing as you run each command:
- Run cells 1 and 2 to connect to a SparkContext.
A SparkContext is the connection to a Spark cluster. It’s how you create RDDs and other items on that cluster.
Connect to your Cloudant database.
Run cell 3 (which you just customized, adding your database credentials) to connect to Cloudant, where the AMA data resides.
Create the dataframe and get it in tabular format. In cell 4, run
df.printSchema()then in cell 5, run
Prep the dataframes for SQL commands. In cell 6, run
Now start analyzing this data.
Watson Tone Analyzer captures tones in the text, gauging:
- emotions like Joy, Disgust, Anger, Fear, and Sadness
- social traits like Agreeableness, Openness, Conscientiousness, Extraversion, and Emotional Range
- language styles like Analytical, Tentative, and Confident
First, run the following code to compute the distribution of tweets by sentiment scores greater than 70%.
sentimentDistribution= * 13 for i, sentiment in enumerate(df.columns[-23:13]): sentimentDistribution[i]=sqlContext.sql("SELECT count(*) as sentCount FROM reddit where cast(" + sentiment + " as String) > 70.0") .collect().sentCount
- With the data stored in sentimentDistribution array, run the following code that plots the data as a bar chart.
%matplotlib inline import matplotlib import numpy as np import matplotlib.pyplot as plt ind=np.arange(13) width = 0.35 bar = plt.bar(ind, sentimentDistribution, width, color='g', label = "distributions") params = plt.gcf() plSize = params.get_size_inches() params.set_size_inches( (plSize*3.5, plSize*2) ) plt.ylabel('Reddit comment count') plt.xlabel('Emotion Tone') plt.title('Histogram of comments by sentiments > 70% in IBM Reddit AMA') plt.xticks(ind+width, df.columns[-23:13]) plt.legend() plt.show()
- In the last cell, run the following code to group by tone values:
comments= for i, sentiment in enumerate(df.columns[-23:13]): commentset = df.filter("cast(" + sentiment + " as String) > 70.0") comments.append(commentset.map(lambda p: p.author + "nn" + p.text).collect()) print "n--------------------------------------------------------------------------------------------" print sentiment print "--------------------------------------------------------------------------------------------n" for comment in comments[i]: print "[-] " + comment +"n"
Scroll through the resulting list. You’ll see comments grouped by tone. Remember that these are comments that scored greater than 70% for each value.
Some comments appear under multiple headings, because they scored high for more than one. For example, the following comment appears under the language style Analytical and also under the social trait Emotional Range (sensitivity to environment, moodiness).
How do you keep convincing people to pay for Lotus notes as an email solution?
Watson Tone Analyzer documentation says: “Tone analysis is less about analyzing how someone else feels, and more about analyzing how you are coming across to others.” So, how did IBMers come across within this AMA?
Comments from IBMers take up most of the Agreeableness (tendency to be compassionate and cooperative toward others) section.
They live there beside some “agreeable” questions from outsiders that come with a wink, like
Is your favorite TV show Halt and Catch Fire? I really want it to be...
That comment also scored high under Extraversion and Emotional Range, maybe for its enthusiasm.
No comments from IBMers appear under Emotional Range. These guys are a bunch of cool cats, perhaps–or just polite and friendly AMA hosts.
Note: No comments scored over 70% on emotions like Joy, Anger, Fear, Disgust, and Sadness. This conversation just didn’t get that heated. Try running another reddit AMA discussion through these same steps to see how results differ.
So, when reddit includes this IBM AMA in their next book, which category will they apply? Comments from non-IBMers may land this AMA in the Provocative or Humorous group. IBMers alone? Courageous, of course. ;-) Or perhaps, Informative, which would put us in good company.
Meanwhile, we’ll keep working hard and aspire to Ingenious.
Now you know how to tweak the Simple Data Pipe to load data from a source you want, like reddit. Once you do so, the Cloudant-Spark Connector makes it easy to perform analysis on your Cloudant JSON. In this example, we used an iPython notebook to help us leverage Watson Tone Analyzer, but you can use the analysis tool of your choice.
When you ran Simple Data Pipe, the reddit AMA landed in Cloudant. From there, it’s a breeze to send data on into dashDB. The dashDB data warehouse is also a great place to run analytics. Stay tuned for my next post, which will show you how to take reddit data, load it into dashDB, and analyze with R (Can’t wait? Watch a video on how these two work together).
Try these AMAs
Launch your Simple Data Pipe app again and return to the Load reddit AMA Data section. In step 7, swap in one of these AMA URLs and check out the results.
- Matei Zaharia, creator of Spark
- Chris Rock
- Tim Berners Lee
- Neil deGrasse Tyson
- Bill Gates
- Louis C. K.
- Amy Poehler
- IBM’s Chef Watson
- Barack Obama