It’s hard to deliver data in a way that’s easy for users to consume. Think of an RSS news feed. You subscribe and start drinking from the firehose, but maybe you’re interested only in technology news or want to quickly see the latest Brexit developments.

Filters and search features are fundamental to any app that delivers data. I’ll show you how to implement these features in a way that leverages some artificial intelligence, so you don’t need to tax your own too heavily. We’ll create an app that takes an RSS news feed, passes it through the Alchemy Language API (which provides text analysis through natural language processing), then saves the enhanced data to a Cloudant database for querying.

My news demo app lives on a simple, static html page. It features a hierarchical menu that lists articles by category and a search box that lets users find articles featuring specific terms.

Search and data structure

Presenting a compelling search experience like this in a web or mobile application is a problem of two halves:

  • free-text search – the user supplies a multi-word phrase and the search engine returns the documents which best match the query e.g. find me documents that best matches “cat gifs”
  • fielded search – the user provides a structured search e.g. find me documents published in the last month from the entertainment category with cat in the title

A document may contain a mixture of structured and unstructured data. Let’s take an RSS feed from a news website. A typical news feed article might look like this:

    <item>
        <title><![CDATA[Is this the world's most oversubscribed school?]]></title>
        <description><![CDATA[A school in India that offers an elite education for poor rural families receives 250,000 applications for 200 places.]]></description>
        <link>http://www.bbc.co.uk/news/business-37372776</link>
        <guid isPermaLink="true">http://www.bbc.co.uk/news/business-37372776</guid>
        <pubDate>Tue, 11 Oct 2016 23:20:23 GMT</pubDate>
        <media:thumbnail width="976" height="549" url="http://c.files.bbci.co.uk/85F1/production/_91298243_img_7116.jpg"/>
    </item>

Some of the data is fielded and factual (the link and pubDate), but some data is unstructured text. There is no indication of a hierarchy of categories but we could use a free-text search algorithm to find best matches. Can we do better? Can we make our unstructured text more structured?

Alchemy – making sense of chaos

The Alchemy API comes from Watson, IBM’s cognitive computing division. The Alchemy Language API parses unstructured data and offers its take on what the data is about – its structure, sentiment, taxonomy, and which entities it refers to–entities being people, places, companies, and so on. If we pass a news article about Indian schools to Alchemy, it tells us that the article:

  • is about the country India with some useful links
  • offers some keywords, like “poor rural families”, “elite education”, “school”, “India”
  • suggests some taxonomy options, like /family and parenting, /education/school

Using this suggested data, we can enhance our bare RSS feed to add additional structure and to link discrete articles together by geography, similarity and theme.

You can provision your own instance of the Alchemy API service on Bluemix, and try it for free. After you create the service, open it, click Service Credentials and click Add to get your API key, which you’ll need in minute.

Using Node-Red to fetch RSS feeds

Node-Red is a simple way to prototype and implement data flows visually. You drag functional blocks onto a visual flow editor and draw lines between them. Provision the Node-Red service on Bluemix and try it for free. Cloudant comes along automatically, which is handy, because we want to feed news updates from the BBC into Cloudant, which we can do with this this simple flow:

bbc_cloudant

The block on the left is an RSS feed reader. I created it by selecting a feedparse input (under Advanced) and configuring it with the BBC RSS URL so it knows where to fetch the data. Next I went to Storage and dragged Cloudant out into the flow.

But we also want Alchemy in the mix, adding its taxonomy and entity data before we save to Cloudant. In the left column, under IBM Watson, find Feature Extract and drag it into the flow. Insert it in-between the BBC reader and Cloudant. Configure it by entering your Alchemy API key.

bbc_alchemy_cloudant

Now, data passes through the Alchemy API which adds taxonomy and entity data before it is saved to Cloudant. Once running, all new articles published to the news sites pass through your pipeline, processed by Alchemy, and stored in Cloudant without having written a line of code.

Note: If you’re on a Cloudant multi-tenant plan, there’s a possibility that the feed may exceed your “requests per second” quota and you could lose data. To avoid this, add a delay widget to the chain in front of Cloudant:
Screen Shot 2016-12-01 at 16.47.06
and set its rate limit to something like 3 per second.
Screen Shot 2016-12-01 at 17.02.07

Next, we’ll use Cloudant’s built-in search and indexing features, to create views we can query to power a web-based front end.

Tip: If you’re not keen on visual programming, then you can achieve the same thing with regular, text-based code. There are Watson Alchemy SDKs for a number of languages, and with a few lines of code, you can configure an RSS feed parser to write data to Cloudant.

Indexing structured data with MapReduce

Let’s look at the structure of the data as it arrives in Cloudant:

{
  _id: "17c1444",
  _rev: "1-f2bb9e31f865df943d6b5d4934f1f844",
  topic: "http://www.bbc.co.uk/news/health-37642587",
  payload: "Some children are born with a fussiness towards food which is hard-wired into their DNA, scientists say - so are parents off the hook?",
  article: { ... },
  features: { ... }
}

The article contains further details from the RSS feed. The features object is the data that Alchemy has added, including:

  • entity – an array of entities (people, places, countries) identified in the article
  • taxonomy – an array of suggested places in hierarchy of categories

We can use Cloudant MapReduce to extract the data from the document and emit keys and values into a index:


// extract entities
function (doc) {
  if (doc.features) {
    if (doc.features.entity) {
      doc.features.entity.map(function(e) {
        emit(e.type + ':' + e.text);
      });
    }
  }
}

// extract the taxonomy with the highest score
function (doc) {
  if (doc.features) {
    if (doc.features.taxonomy) {
      var winner = null;
      var winningscore = 0;
      doc.features.taxonomy.map(function(e) {
        if (e.score > winningscore) {
          winningscore = e.score;
          winner = e.label;
        }
      });
      if (winner) {
        emit(winner.slice(1).split('/'), null);
      }
    }
  }
}

With the _count reducer, we can use the indexes to get documents from anywhere in the taxonomy hierarchy or that contain entities. The same indexes can be used to aggregate the data (to provide categories and counts of articles) and for selection (to provide a list of documents within a category).

Indexing unstructured data with Cloudant Search

Cloudant can also create free-text search indexes which are populated with unstructured text, chiefly the article’s title and description field. We can also index the content discovered by Alchemy too! Cloudant Search indexes, like MapReduce indexes, are created by supplying JavaScript functions. Instead of calling an emit function, we call an index function to define which fields are to be indexed or stored:

function (doc) {
  index("title", doc.article.title, {store: true});
  index("default", doc.article.title);
  index("default", doc.article.description);
  if (doc.features.entity) {
    doc.features.entity.map(function(e) {
      index("default", e.text)
    });
  }
}

The above function puts the article’s title, description and entities in the default index. You can then use the default index to power a best match search of all of the news items given a users search phrase.

Here is the news

We can visualise the news data in a web-based user interface using our MapReduce index of the taxonomy to present a hierarchical menu showing the number of articles in each category:

hier_menu

We can use the same index to retrieve lists of articles from anywhere in the hierarchy. Our index of entities can be used to link to and retrieve articles that are about individuals or places. Our free-text search can be used to provide a site search facility.

text_search

The news site can be a static, single-page web application. Its dynamic content is fetched by in-page web requests to the Cloudant server (as long as the Cloudant database is publicly readable). You can serve the site out on any web server or even host it on Github Pages.

This static news demo site is updated every fifteen minutes by Node-Red.

Homework

Now that we have our news data and our index definitions in a Cloudant database, what’s to stop us replicating the data from the server to a client-side database such as PouchDB or Cloudant Sync and reconfiguring our web app to read data from its local store instead? This offline-first approach would allow us to sync the news to our local device and consume it offline, even without a network connection. Try it for yourself!

Join The Discussion

Your email address will not be published. Required fields are marked *