The Text toolkit in Streams allows you to analyze unstructured text in real time by integrating with the Text Analytics component in IBM BigInsights.   In Streams 4.1, the text toolkit has been enhanced to support BigInsights version 4.1.  This article will show you how to create a Streams application that uses the latest version of BigInsights to extract insight from streaming text.

Prerequisites

To follow along in this article, you’ll need:

Skills

  • You should have a basic understanding of Streams to follow along in this article.
  • To learn more about Streams, see this Streams Quick Start Guide.

Overview

Creating a real time text analysis application involves the following steps:

  1. Create a BigInsights extractor to look for patterns of interest.  You can create extractors using the drag-and-drop web interface called the Information Extraction Web Tool, that was made available in BigInsights 4.0.  Learn more about extractors and in the BigInsights documentation.
  2. Export the extractor from the web tool.
  3. Use the Text toolkit to load and run the exported extractor in your streams application. The TextExtract operator in the text toolkit loads and executes extractors at runtime. For each incoming tuple, it runs the extractor and produces an output tuple if the extractor reports that the input data matched the pattern in the extractor.

This article will demonstrate creation of an extractor in the web tool and using the extractor in a Streams application.

Changes in Streams 4.1

With the addition of the web interface, extractors no longer have to be manually written in the AQL language.  Thus, as of Streams 4.1, the Eclipse plugins that were used to create extractors in Streams Studio are no longer available.  Of course, the text toolkit still supports extractors created outside of the web tool.

Creating an extractor in the Web tool

As was mentioned before, the first step is to create an extractor using the web tool.

The web tool  is accessible from BigInsights home:

  • BigInsights home is accessible for users of BigInsights on Bluemix by clicking “Launch” in your Bluemix dashboard.
  • If you have an on-prem install of BigInsights, BigInsights Home is typically accessible in your browser through the following URL:
https://<knox_host>:<knox_port>/<knox_gateway_path>/default/BigInsightsWeb/index.html and clicking "Text Analytics".

Click “Text Analytics” to launch the web tool from BigInsights home.

The following video shows how to use the web tool to create an extractor and then integrate that extractor with Streams.

 

Summary

As discussed in the video, these are the steps to follow when trying to import an extractor from BigInsights into Streams:

  1. After creating your extractor, export it using the “Export AQL” option in the web tool.
  2. Unpack the generated archive to a directory, and use the moduleSearchPath parameter in the TextExtract operator to point to that directory.
  3. Set the names of the attributes in the output stream of the TextExtract operator to the same names as the output columns in your extractor. (Tip:Replace spaces in column names with underscore ‘_’).
  4. Set the outputViews parameter to the name of your extractor.
  5. Set outputMode parameter to “multiPort”.

The source code for the Streams application in the video is available on Github.

Further examples

BigInsights includes a starter kit that shows how to use Text Analytics to perform sentiment analysis on data from Twitter.

If you have worked through that example, you can easily integrate it into a Streams application by following the same steps as in described in the video and summarized above:

Export AQL:

exp3rt

 

Following steps 2-5 should give you something similar to this:

stream<rstring text, rstring target, rstring polarity> ProductSentimentStream =
 TextExtract(InputStream)
 {
   param
     moduleSearchPath : "<path_to_unpacked_extractor>" ;
     outputMode : "multiPort" ;
     outputViews : "Product_Sentiment" ;
     inputDoc : "tweet" ;
}
 

The complete Streams application is also available on GitHub.

Conclusion

The web tool of BigInsights Text Analytics allows you to quickly create extractors that analyze unstructured text.  The Text Toolkit in Streams has been enhanced to easily integrate extractors created in the web tool to analyze unstructured text in real time.

Update: Read part 2 of this article to learn how to add keywords of interest to the extractor without having to restart your application.

Download the sample applications from GitHub:

10 comments on"Real Time Text Analysis with Streams and BigInsights"

  1. Javeria Nadeem May 17, 2018

    hi,
    Is bigInsights on Bluemix depreciated?

  2. Hi, I made an extractor for my streaming application following this tutorial and tried connecting it to live Twitter feed using just the “Twitter stream” code in the Twitter smackdown sample. It worked fine when I forgot to add a filter and showed sentiments of all the incoming tweets but when I included the filter for it to only show sentiments for the terms that I had added in the dictionary it threw Processing Read Exception and other errors like “SSL peer shut down indirectly”. Can someone please help?

    • Natasha DSilva June 01, 2018

      Hi,
      I am glad you were able to use the tutorial and it worked, sorry you are having problems now.
      SSL Peer errors could be a Twitter connection problem but without a more detailed error, it is hard to give more details. Could you please paste the stack trace? Are you running in the Streaming Analytics service?
      Which filter are you referring to, a filter in the code that connects to Twitter, or that processes the text?
      If you could also paste an example of the relevant operators (twitter and TextExtract) that would be great.

      • Hi,

        -These are the run time errors that I’m getting in the console of streams studio when I launch the application:

        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:com.ibm.streamsx.inet.http.HTTPStreamReader.onReadException:-1] – Processing Read Exception
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – javax.net.ssl.SSLException: SSL peer shut down incorrectly
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.jsse2.a.b(a.java:62)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.jsse2.a.a(a.java:240)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.jsse2.as.a(as.java:702)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.jsse2.as.a(as.java:219)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.jsse2.e.read(e.java:51)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:212)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:177)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:249)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.util.zip.InflaterInputStream.read(InflaterInputStream.java:169)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.util.zip.GZIPInputStream.read(GZIPInputStream.java:128)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:323)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:365)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – sun.nio.cs.StreamDecoder.read(StreamDecoder.java:211)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.io.InputStreamReader.read(InputStreamReader.java:205)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.io.BufferedReader.fill(BufferedReader.java:172)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.io.BufferedReader.readLine(BufferedReader.java:335)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.io.BufferedReader.readLine(BufferedReader.java:400)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.streamsx.inet.http.HTTPStreamReaderObj.sendRequest(HTTPStreamReaderObj.java:94)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.streamsx.inet.http.HTTPStreamReaderObj.run(HTTPStreamReaderObj.java:125)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.lang.Thread.run(Thread.java:785)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.streams.operator.internal.runtime.OperatorThreadFactory$2.run(OperatorThreadFactory.java:137)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:com.ibm.streamsx.inet.http.HTTPStreamReader.onReadException:-1] – Will Retry
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – javax.net.ssl.SSLException: SSL peer shut down incorrectly
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.jsse2.a.b(a.java:62)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.jsse2.a.a(a.java:240)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.jsse2.as.a(as.java:702)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.jsse2.as.a(as.java:219)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.jsse2.e.read(e.java:51)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] -org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:212)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:177)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:249)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.util.zip.InflaterInputStream.read(InflaterInputStream.java:169)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.util.zip.GZIPInputStream.read(GZIPInputStream.java:128)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:323)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:365)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – sun.nio.cs.StreamDecoder.read(StreamDecoder.java:211)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.io.InputStreamReader.read(InputStreamReader.java:205)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.io.BufferedReader.fill(BufferedReader.java:172)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.io.BufferedReader.readLine(BufferedReader.java:335)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.io.BufferedReader.readLine(BufferedReader.java:400)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.streamsx.inet.http.HTTPStreamReaderObj.sendRequest(HTTPStreamReaderObj.java:94)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – com.ibm.streamsx.inet.http.HTTPStreamReaderObj.run(HTTPStreamReaderObj.java:125)
        ERROR #splapptrc,J[0],P[0],StuffTwitterSays M[?:?:0] – java.lang.Thread.run(Thread.java:785)

        -These are the operators to access live twitter feed which is similar to the “twitter stream” code in the Twitter Smackdown sample but I had to make few modifications since I’m not using the Streaming Analytics service:

        streamStuffTwitterSays = HTTPGetStream()
        {
        param
        url : “https://stream.twitter.com/1.1/statuses/sample.json” ;
        dataAttributeName : “tweet_data” ;
        authenticationType : “oauth” ;
        authenticationFile :
        “/home/streamsadmin/workspace/AirlineSentiment/airlineSentiments/data/twitter.properties” ;
        }

        stream TwitterStreamData =
        JSONToTuple(StuffTwitterSays)
        {
        param
        jsonStringAttribute : “tweet_data” ;
        }

        -These are the operators for text extractor which are similar to the operators in the BigInsightsStarterKitApp.spl sample but with few modifications according to my requirements:

        stream RenamedStream = Functor(TwitterStreamData)
        {
        output
        RenamedStream : tweet = text ; //rename the “text” attribute in the tweet b/c it conflicts with the output from the Sentiment extractor

        }

        stream airlineSentimentStream =
        TextExtract(RenamedStream)
        {
        param
        outputMode : “multiPort” ;
        outputViews : “Airline_Sentiment” ; //name of extractor
        inputDoc : “tweet” ;
        moduleSearchPath: “etc/SentimentExtractorAirline”;
        }

        () as PrintResults = Custom(airlineSentimentStream)
        {
        logic
        onTuple airlineSentimentStream :
        {
        printStringLn(“Sentiment = ” + polarity + “, text: ” + text + “, target: ”
        + target) ;
        }

        }

        -About the filter that I’m referring to, the sentiment analysis for random tweets works just fine. What I am having trouble with is, when I apply the filter in the sentiment extractor (module 4 step 2 in this link https://ibm-open-platform.ibm.com/biginsights/starterkits/biginsights-starter-kit2-cloud/starterkit2cloud.html#/starterkit) that runs sentiment analysis on tweets that are specifically according to my dictionary, it returns nothing but these errors. So, I am not sure if it is an error due to lack of connection with Twitter or due to trouble in the extractor that I am using to process tweets.

        • Natasha DSilva June 04, 2018

          Hi,
          Thanks for clarifying. Looking at the stack trace it is coming from the StuffTwitterSays Java operator, so it seems to be a problem connecting to Twitter. If it is only happening sometimes then it is likely a problem with Twitter that is unrelated to the operators. I don’t think that adding sentiment analysis would cause problems connecting to Twitter.

          You might not be getting results from sentiment analysis if there is no mention of those keywords in the tweets you have.
          To test that the extractor is working properly in Streams, I would run it using the sample data from the BigInsights starter kit. If you get output using that sample data, then you know that the extractor is working properly.

          Another way to determine if the problem is with the extractor or with Twitter is to try testing the extractor using saved Twitter data.
          First, comment out the text extract portion of the application.
          Then, add a FileSink to save the data from Twitter to a file:
          () as SavedRawTweets = FileSink(StuffTwitterSays){
          param
          file : "/tmp/SavedTweets.txt";
          format : line; //use line to write lines as they are
          }

          Recompile and run the application for a few seconds. This will give you a file with Twitter data.
          Next, comment out both the FileSink you just added and the StuffTwitterSays operators and replace it with a FileSource to read the data you just saved (if you leave the FileSink it will overwrite the file as you are reading from it!)
          stream<rstring data> StuffTwitterSays = FileSource(){
          param
          format : line;
          file: "/tmp/SavedTweets.txt";
          }
          Then you can uncomment out your text extract operators and run the application.
          If you get output using the saved Twitter data, then you know that the extractor is working.

          • Hi, thanks for your guidance. I think, like you suggested, the problem is that there’s no mention of the keywords in the tweets that are being fetched from the API that I’ve provided in the url parameter which is “https://stream.twitter.com/1.1/statuses/sample.json”. Is there any API you know of that can fetch 100% real time tweets from twitter?

          • Natasha DSilva June 07, 2018

            Hi,
            I have only beginner level experience with the Twitter API. Have you tried searching twitter.com manually to see if there are actually tweets about that topic, that aren’t being reported through the API?

            There are better resources on their documentation to help you figure out how to get the data you are looking for.
            https://developer.twitter.com/en/docs/tweets/filter-realtime/overview.html
            https://developer.twitter.com/en/products/tweets
            You might need premium access to retrieve all the tweets.
            Also, from what I understand, it seems that if there is no one tweeting about that topic right now, you might not have any results using the real time API. Perhaps you want to try using the Search API which checks up to the last 7 days, nstead of real time status updates.
            For example, to search for “delta airlines”:
            Try “https://api.twitter.com/1.1/search/tweets.json?q=delta%20airlines%20&result_type=mixed” as the URL.
            See https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html

  3. Javeria Nadeem July 26, 2018

    Hey Natasha, quick help needed.
    I am running a python script containing the fetched streamed tweets and making it read from a file source operator in spl.
    I am facing certain issues; 1- the file when open and updating is not being read simultaneously.
    2- Also is there a certain file limit file source operator can read? as in when my file gets heavy with json data it throws an excepting however when i reduce its storage MB it works perfectly fine.
    please help me out with it.

Join The Discussion