As developer advocates one of our jobs is to help developers who are experiencing issues with our products. Most developers turn to Stack Overflow to ask questions when they run into trouble (over 11.5 million questions asked to date!). We constantly monitor Stack Overflow for questions related to our products, or our personal expertise, to provide as much assistance to developers as possible. We answer a lot of questions, and it’s important that we track and analyze those questions.

How do we conduct our Stack Overflow analysis? In this post we are going to show you how to extend the Stack Overflow connector to provide real value and solve real problems. We’ll show you how we use it to monitor the products we support, improve our responsiveness, and most importantly help our fellow developers.

Example Stack Overflow questions filtering
With 11,000,000+ questions, getting relevant Stack Overflow insights on a product is a challenge. We’ll show you how we do it, with our open source Simple Data Pipe app.

The Stack Overflow Connector

In this tutorial we showed you how to build a Simple Data Pipe Connector for Stack Overflow. The end result was a connector that allowed users to select one of the top 30 most active tags on Stack Overflow and retrieve the 30 most active questions for that tag. While the tutorial served its purpose as a gentle introduction to Data Pipe Connector development, we really didn’t create a connector that was all that useful.

In this post we will show you how to extend the Stack Overflow connector to move data that will enable us to:

  • Find questions that we need to answer.
  • Find out which of our products are most popular on Stack Overflow.
  • Run statistics to determine response rate, acceptance rate, etc.

The Simple Data Pipe SDK

Reflecting on the Stack Overflow connector we built in the previous tutorial, it’s easy to see where it was lacking:

  • We needed to be able to select less popular and more relevant tags, such as cloudant or apache-spark.
  • We needed to be able to pull more than 30 questions.
  • We needed to pull in the questions and the answers to those questions.

It was obvious we needed to extend our connector. The Simple Data Pipe SDK allows us to extend almost every part of the connector, including:

  • Adding custom properties to the connector configuration.
  • Customizing the user interface for managing and running the connector.
  • Massaging or enhancing the data moved from the connector into Cloudant.

Adding Custom Properties

Every pipe created in the Simple Data Pipe has a correspending document stored in the pipe_db database in Cloudant. This document contains information about the type of pipe (i.e., stackoverflow) and the configuration specific to that pipe. Here is a sample document stored for a Stack Overflow pipe:

{
  "_id": "fd1ffa968a467f73ce93d2a4720fdec4",
  "_rev": "28-666e0c198da190afafa275230e935a05",
  "connectorId": "stackoverflow",
  "name": "stackoverflow-html-tag",
  "type": "pipe",
  "version": 1,
  "clientId": "6812",
  "clientSecret": "ShxD2WxxxxxxSHxxJExX5x((",
  "oAuth": {
    "accessToken": "(R38xxxxC8WxxxPMN*Sp8Q))"
  },
  "tables": [
    {
      "name": "javascript",
      "label": "javascript"
    },
    {
      "name": "java",
      "label": "java"
    },
    // ...
    {
      "name": "html",
      "label": "html"
    }
  ],
  "selectedTableName": "html",
  "selectedTableId": "html"
}

The Simple Data Pipe SDK allows connector developers to add and access custom properties on this document. Developers can use those properties in code to make decisions on how to retrieve data from the desired data source.

To get the data that we need from Stack Overflow we are going to add three new properties:

  • customTags: A comma-separated list of tags for questions that should be downloaded from Stack Overflow.
  • questionCount: The number of questions to download for each tag.
  • downloadAnswers: A boolean value specifying whether or not to download the answers for all tags.

Extending the User Interface

In order to provide users the ability to specify custom values for our three new properties (customTags, questionCount, and downloadAnswers) we need to make some changes to the user interface.

We are going to customize the Filter page by adding a text field for users to enter the list of custom tags. This will populate our customTags property. We’ll add a pulldown with a list of paging options to populate our questionCount property. Finally, we’ll add a checkbox that will allow a user to specify whether or not to retrieve the answers. This will set our downloadAnswers property.

We start by copying the pipeDetails.tables.html page from the simple-data-pipe project into the simple-data-pipe-stackoverflow project (simple-data-pipe/app.templates simple-data-pipe-connector-stackoverflow/lib/templates). We then add the following HTML:

<div class="form_field" ng-if="selectedPipe.selectedTableId == 'custom'">
    <label for="custom_tags" class="form_label">Custom Tags (comma separated)</label>
    <input type="text" class="input_text" name="customTags" id="custom_tags" required ng-model="selectedPipe.customTags" placeholder="cloudant,apache-spark">
</div>
<div class="form_field">
    <label for="custom_tags" class="form_label">Number of Questions per Tag</label>
    <select class="input_select" id="questions_count" name="questionCount" ng-model="selectedPipe.questionCount">
        <option value="100">100</option>
        <option value="200">200</option>
        <option value="500">500</option>
        <option value="1000">1000</option>
    </select>
</div>
<div class="form_field">
    <input type="checkbox" name="downloadAnswers" id="download_answers" required ng-model="selectedPipe.downloadAnswers"> Download Answers
</div>

Our new Filter page looks like this:

simpleDataPipeStackoverflowUI

When a user saves their filter options we can see the three new properties added to the pipe config document in the database:

{
  "_id": "8237fa1bd2ea945cee7f89f71c1fa112",
  "_rev": "98-694fb538c0a10d2245658cb90c5e6c1c",
  "connectorId": "stackoverflow",
  ...
  "customTags": "apache-spark,cloudant,dashdb",
  "questionCount": "500",
  "downloadAnswers": true
}

Now that we have these three properties available to us, we need to use them in our connector code. The extent of the changes are too great for this post, but we can see that these properties can be easily accessed from the pipe object passed into many of the connector functions, for example:

this.fetchRecords = function(dataSet, pushRecordFn, done, pipeRunStep, pipeRunStats, pipeRunLog, pipe, pipeRunner) {
    var tags = pipe.customTags;
    var pageSize = pipe.questionCount;
    var downloadAnswers = pipe.downloadAnswers;
    //...
}

The Stack Overflow Question Data Structure

After we update our code to use these properties and run our pipe we can see the questions moved to Cloudant. Here is a sample question:

{
  "_id": "0506c2a366b431fbbdf939f4aae574a3",
  "_rev": "1-27c2ab3a997fa0cd73fd5f3cfe0168f4",
  "tags": [
    "java",
    "nosql",
    "cloudant"
  ],
  "owner": {
    "user_id": 3052176,
    ...
  },
  "is_answered": false,
  "answer_count": 1,
  ...
  "question_id": 29216049,
  "title": "Updating Cloudant database using Java",
  "body": "Was wondering if it possible to write code in Java that will update the entries in my Cloudant database?",
  "answers": [
    {
      "owner": {
        "user_id": 4284412,
        ...
      },
      "is_accepted": false,
      "question_id": 29216049,
      "body": "Yes,  Its possible to write JAVA code to update entries / documents in Cloudant database.  You need to use the java-cloudant driver.  Please have a look at the following project on github."
      ...
    }
  ],
  ...
}

As you can see we are now retrieving and associating answers with questions. We’ve also highlighted a few other important fields:

  • tags: The tags associated to the question.
  • is_answered: A boolean specifying whether or not an answer has been accepted by the user who asked the question.
  • answer_count: The number of answers to the question.

In the next section we’ll use these fields to create custom queries to find the data that we need to gain greater insight into our Stack Overflow developer community.

Querying and Analyzing the Stack Overflow Data

We are going to start by creating a new design document in Cloudant that will allow us to aggregate and search our Stack Overflow data. Specifically, we will create views and search indexes to:

  • Get the number of questions for a tag that have or have not been answered.
  • Get the number of questions for a tag that have or have not been accepted (by the owner of the question).
  • Get a list of questions for a tag that have no answers.

We rolled up these views and indexes into a single design document:

{
  "_id": "_design/questions",
  "views": {
    "by_tag": {
      "reduce": "_sum",
      "map": "function (doc) {n  if (doc.tags) {n    for (var i=0; i<doc.tags.length; i++) {n      emit(doc.tags[i], 1);n    }n  }n}"
    },
    "by_tag_accepted": {
      "reduce": "_sum",
      "map": "function (doc) {n  if (doc.is_answered && doc.tags) {n    for (var i=0; i<doc.tags.length; i++) {n      emit(doc.tags[i], 1);n    }n  }n}"
    },
    "by_tag_not_accepted": {
      "reduce": "_sum",
      "map": "function (doc) {n  if (! doc.is_answered && doc.tags) {n    for (var i=0; i<doc.tags.length; i++) {n      emit(doc.tags[i], 1);n    }n  }n}"
    },
    "by_tag_answered": {
      "reduce": "_sum",
      "map": "function (doc) {n  if (doc.answer_count && doc.answer_count > 0 && doc.tags) {n    for (var i=0; i<doc.tags.length; i++) {n      emit(doc.tags[i], 1);n    }n  }n}"
    },
    "by_tag_not_answered": {
      "reduce": "_sum",
      "map": "function (doc) {n  if (! doc.is_answered && doc.answer_count == 0 && doc.tags) {n    for (var i=0; i<doc.tags.length; i++) {n      emit(doc.tags[i], 1);n    }n  }n}"
    }
  },
  "language": "javascript",
  "indexes": {
    "by_tag": {
      "analyzer": "standard",
      "index": "function (doc) {n  if (doc.tags && doc.tags.length > 0) {n    for (var i=0; i<doc.tags.length; i++) {n      index("tag", doc.tags[i]);n    }n  }n}"
    },
    "by_tag_answer_status": {
      "analyzer": "standard",
      "index": "function (doc) {n  index("accepted", doc.is_answered);n  index("answered", doc.answer_count > 0);n  if (doc.tags) {n    for (var i=0; i<doc.tags.length; i++) {n      index("tag",doc.tags[i]);n    }n}n}"
    }
  }
}

We’ll use the following views to query statistics:

  • questions/by_tag: This will return the total number of questions for a tag.
  • questions/by_tag_answered: This will return the total number of answered questions for a tag.
  • questions/by_tag_not_answered: This will return the total number of questions that have not been answered for a tag.
  • questions/by_tag_accepted: This will return the total number of accepted questions for a tag.
  • questions/by_tag_not_accepted: This will return the total number of questions that have not been accepted for a tag.

The first thing we are going to look at is the total number of questions for tags apache-spark, cloudant, and dashdb. We’ll do this by querying the questions/by_tag view. For the cloudant tag this query would look something like this:

curl -X GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/stackoverflow_custom/_design/questions/_view/by_tag?group=true&key=%22cloudant%22

Example response:

{"rows":[
    {"key":"cloudant","value":476}
]}

There have been 476 questions labeled with the tag cloudant. If we run the same query for apache-spark and dashdb we can see which product is the most popular on Stack Overflow:

Tag # Questions
apache-spark 12,521
cloudant 476
dashdb 58

Let’s see how well these products are being supported by querying the questions/by_tag_answered view.

curl -X GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/stackoverflow_custom/_design/questions/_view/by_tag_answered?group=true&key=%22cloudant%22

Example response:

{"rows":[
    {"key":"cloudant","value":427}
]}

427 of the 476 questions labeled with the tag cloudant have been answered. We can also use the questions/by_tag_accepted view to find how many questions have been accepted. Here are the results for all of our three tags:

Tag # Questions # Answered % Answered # Accepted % Accepted
apache-spark 12,521 9,617 76.8 7,376 58.9
cloudant 476 427 89.7 339 71.2
dashdb 58 53 91.4 39 67.2

As you can see around 90% of questions tagged with cloudant or dashdb have been answered while over 23% of questions tagged with apache-spark have gone unanswered. So, let’s see if we can find a few of these questions and start answering them.

In the design document we created the following search indexes:

  • questions/by_tag: This will return all of the questions that have a tag that matches our query.
  • questions/by_tag_answer_status: This will return all of the questions that have a tag that matches our query and match our answer param

We can query the questions/by_tag_answer_status index passing in the tag and the answered: parameter set to false, as follows:

curl -X GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/stackoverflow_custom/_design/questions/_search/by_tag_answer_status?q=tag:%22apache-spark%22+AND+answered:false&include_docs=true&limit=2

In this example we have limited our search results to two. The result is two questions without answers:

{
   "total_rows":2904,
   ...
   "rows":[
      {
         "id":"f74f323a1c531ef4c5ef6faf3fe2e074",
         "order":[
            3.3885726928710938,
            6
         ],
         "fields":{},
         "doc":{
            "_id":"f74f323a1c531ef4c5ef6faf3fe2e074",
            "_rev":"1-5c6a960c4457a7382cbb0729c0844137",
            "tags":[
               "apache-spark"
            ],
            "owner":{
               "reputation":24,
               "user_id":1935652,
               ...
            },
            "is_answered":false,
            "view_count":3,
            "answer_count":0,
            "score":1,
            "last_activity_date":1460620372,
            "creation_date":1460620372,
            "question_id":36616897,
            "link":"http://stackoverflow.com/questions/36616897/task-data-locality-no-pref-when-is-it-used",
            "title":"Task data locality NO_PREF. When is it used?",
            "body":"According to Spark doc, there are 5 levels of data locality...",
            ...
         }
      },
      {
         "id":"d173ca7647eac111020df96c264137bc",
         "order":[
            3.241180419921875,
            26
         ],
         "fields":{},
         "doc":{
            ...
            "tags":[
               "apache-spark"
            ],
            "owner":{
               "reputation":143,
               "user_id":5245972,
               ...
            },
            "is_answered":false,
            "view_count":12,
            "answer_count":0,
            "score":0,
            "last_activity_date":1460378894,
            "creation_date":1460378894,
            "question_id":36549142,
            "link":"http://stackoverflow.com/questions/36549142/can-i-use-checkpoint-for-spark-in-this-way",
            "title":"Can I use checkpoint for Spark in this way?",
            "body":"The spark doc said about checkpoint...",
            ...
         }
      }
   ]
}

From here we can copy the link for a question, go to the Stack Overflow site, and try to help out another developer in need of assistance.

Conclusion and Next Steps

Using the Simple Data Pipe SDK to extend our Stack Overflow connector, we have been able to gain real insights into how we support developers. We did this by extending the user interface of our basic Stack Overflow connector to give us the ability to choose more relevant data to download. We added new properties to our connector config that we were able to access immediately in code and without database schema changes. Finally, we created views and search indexes in Cloudant to retrieve important statistics and unanswered questions quickly and efficiently.

We’ve barely scratched the surface with what we can do with this data. Here are some potential next steps:

  • Create a dashboard for viewing and sharing these statistics.
  • Create an interface for searching previous answers or unanswered questions.
  • Integrate user information to find the users in our group who are answering the most questions, have the highest % of accepted questions, etc.

You can access the Stack Overflow connector on github at https://github.com/ibm-cds-labs/simple-data-pipe-connector-stackoverflow.

For more information about the Simple Data Pipe and Simple Data Pipe connectors start here.

Join The Discussion

Your email address will not be published. Required fields are marked *