In this article, I’m going to give you two simple applications that can serve as starting points for Text Analytics applications on Streams. The first example will use BigInsights Text Analytics to do normalization of terms. The second example will show how to tokenize, both with the simple SPL function, and using the more full-featured BigInsights Text Analytics.

BigInsights TextAnalytics is a powerful text analysis tool that comes packaged with both Streams and BigInsights. I’m only showing very simple examples of its use in this post. For more information on how to use the many features of BigInsights Text Analytics, see the knowledge center.

Normalization with the Text Toolkit

This example was inspired by some social media analysis surrounding the world cup. The application was tracking player mentions on twitter on a minute-by-minute basis. Streams provides a good platform for such an application, but we needed some way to identify the players, and the problem is the same player is not always named in the same way. For example, “Messi scored a goal!” and “Lionel Messi is impossible” and “the game is Lionel vs Louis” all are referring to Lionel Messi of Argentina, but each uses a different string.

We want to map all the mentions of Lionel Messi to one normalized form. You could build something to do this in SPL — say with a regular expression search, followed by a map to convert the player aliases. But that would be hard to maintain if we wanted to add some new nicknames for players.

So we bring in AQL to do the heavy-lifting. We’ll require as input a table to map aliases to normalized names:

alias,normalizedName
Messi,Lionel Messi
Lionel,Lionel Messi
Louis,Louis van Gaal

Then, the AQL to find all mentions of players in the table and then map them to their normalized name looks like this:

-- Using an external table means we can supply the table at job submission time.
create external table AliasTable (alias Text, normalizedName Text) allow_empty false;

create dictionary alias_dict from table AliasTable with entries from alias and case insensitive;

-- Look for aliases in the document
create view foundNamesRaw as 
extract dictionaries 
         alias_dict on D.text as foundName
from Document D;

-- Map the names to their normalized values.  
create view foundNames_alias as
select F.foundName as foundName,
GetString(N.normalizedName) as normalizedName
from foundNamesRaw F,AliasTable N
where Equals(GetString(ToLowerCase(F.foundName)),GetString(ToLowerCase(N.alias)));

The question that then comes up is how to handle tweets which have two mentions of a player, eg “Lionel Messi is Awesome! Go Messi” What I chose to do to is for each normalized name that appeared in the document to have a list of the aliases used, so the above tweet would produce:

[{normalizedName="Lionel Messi",aliases=["Messi","Lionel Messi"]}].

It also makes sure it looks for the normalizedName directly, so you don’t have to have “Lionel Messi,Lionel Messi” in your alias table.

The full AQL to do this is on GitHub here .

 

This does generate false positives (for example, a tweet mentioning Lionel Hollins, a US Basketball coach, would be identified as mentioning Lionel Messi), but because the data came via world-cup related queries, it sufficed for the application we had.

Next, I’ll show how we use this in SPL. Since the table is an external table in the AQL module, I can supply that table when I start the operator.

namespace normalize.example;
composite NormalizeMentions {
   param 
       // we do this so the application works in both 3.2 and 4.0
       expression $applicationDir: substring(getThisFileDir(),0,length(getThisFileDir()) -1-length("normalize.example"));
   graph 
   stream Documents = FileSource() {
      param 
         file: $inputFile;
         format:line;
   }

   // The text analytics type of normalizedName is a string.  But the text analytics type of the aliases is 
   // actually a Span--that means that in String, we can represent it either as a string, or as a tuple 
   // with a begin and end saying where in the text it's found.  This operator outputs it as a string.
   stream<list<tuple<rstring normalizedName, list<rstring> aliases>> normalizedNames> Names
      = com.ibm.streams.text.analytics::TextExtract(Documents) {
           param 
              // This is the location of the aql module.  The operator compiles it for you.
              uncompiledModules: $applicationDir+"/etc/findnormalize";
              externalTable: "findnormalize.AliasTable="+$applicationDir+"/"+getSubmissionTimeValue("aliastable");
   }

   () as aliasSink = FileSink(Names) {
      param file: "aliasout.txt";
   }
}

When you submit the job, pass the alias table either on the command line or via the IDE.

In standalone mode,  output/bin/standalone aliastable=etc/aliases.csv inputFile=soccer_in.txt.

For distributed mode:  streamtool submitjob output/NormalizeMentions.sab -P aliastable=etc/aliases.csv -P inputFile=soccer_in.txt.

(For Streams 3.2.1 in distributed mode, use streamtool submitjob output/NormalizeMentions.aql -P aliastable=etc/aliases.csv -P inputFile=soccer_in.txt).

If you run via the IDE, you can set the submission time values via the launch screen for either mode:
submission time value screenshot

I’ve discussed this example in terms of world-cup soccer players, but the application is quite general. For example, if you wanted to normalize the character names in Jane Austen’s Pride and Prejudice, you need only change the alias table and the input file. So, for example, in Streams 4.0:
streamtool submitjob output/NormalizeMentions.sab -P aliasTable=etc/PrideAndPrejudiceAliases.csv -P inputFile=$STREAMS_INSTALL/samples/com.ibm.streams.text/splgen_externdict/data/PrideAndPrejudice.txt.

Tokenizing from SPL

Sometimes you want to break down a message into the tokens that make up the message. For example, if you want to do topic detection or sentiment analysis using machine learning over a bag of words, then your first step is to break down the input document into words. SPL includes a family of tokenize functions that make this easy to do from SPL:

stream<list<rstring> tokenList> Tokenized = Functor(Documents) {
    output Tokenized:
    // use space " " as delimiter to tokenize
    tokenList = tokenize(inputDoc," ",false);
}

But punctuation makes that approach work in unexpected ways. Consider the opening of The Iliad:

Sing, O goddess, the anger of Achilles son of Peleus, that brought countless ills upon the Achaeans.
Many a brave soul did it send hurrying down to Hades, and many a hero did it yield a prey to dogs
and vultures, for so were the counsels of Jove fulfilled from the day on which the son of Atreus,
king of men, and great Achilles, first fell out with one another.

The tokenize function uses space as the delimiter to tokenize the strings. ¬†You might expect to find the words Peleus and Achaens¬†in the token list, but you won’t–instead you’ll find Peleus,¬†and Achaeans. (note the comma on the first, and the period on the second). To get what you expect, you will need to use¬†a bigger set of delimiters to include punctuation, as we do below:

stream<list<rstring> tokenList> Tokenized = Functor(Documents) {
    output Tokenized:
        // add ".", "," and "\r" as delimiters
        tokenList = tokenize(inputDoc,"., \r",false); }

Then we’ll see both Peleus and Achaens (no punctuation) in the output. Note that, if instead of a list of tokens in a particular line, you want¬†to output one token per output tuple, you can invoke SPL’s tokenize from a Custom as follows:

stream<rstring token> TokenStream = Custom(Documents) {
    logic onTuple Documents: {
                // add ".", "," and "\r" as delimiters
        list tokens = tokenize(inputDoc, "., \r",false);
        for (rstring myToken in tokens) {
            submit({token=myToken},TokenStream);
        }
    }
}

Tokenizing with the text toolkit
The above examples give a simple and fast tokenization, but it isn’t always enough. When you need more sophistication than SPL’s tokenize can provide, you can use BigInsights TextAnalytics. BigInsights TextAnalytics also includes support for tokenizing in multiple languages. Writers of Text Analytics applications generally make use of the tokenization only indirectly, but we can write a Text Analytics application to do just the tokenization. Let’s also add an optional dictionary of words to ignore. Here’s an example:

module usetokenizer;
-- Let's remove any stop words. Stop words are words that we want to ignore.
create external dictionary StopwordsDict allow_empty true;

-- Find all tokens that don't match a stop word.
-- Create a view of all tokens that is not a stop word.
create view AllTokensUnsorted as
extract regex /.*/ on 1 token in D.text as atoken
from Document D
having Not(MatchesDict('StopwordsDict',atoken));

-- Create a view for all tokens found
create view allTokens as
select T.atoken as atoken
from AllTokensUnsorted T
order by T.atoken;

output view allTokens;

Note: This document is written for SPL developers who want to speed their development process by use BigInsights Text Analytics to do some pre-processing. This tokenization pattern can be useful first step for applications that do text processing in Streams itself–eg, are counting the number of times certain words appear with certain tags in twitter messages, keeping track of trending topics, or learning which words are likely to appear together. In contrast, this pattern is not usually useful for people developing sophisticated BigInsights Text Analytics applications. First, with functions like LeftContextTok, FollowsTok, and statements like extract pattern, a list of all tokens isn’t usually useful, and second it can hurt efficiency.

We can invoke the tokenizer using the following SPL:

namespace tokenize.example;
composite TokenStream { 
   param 
       // we do this so the application works in both 3.2 and 4.0
       expression $applicationDir: substring(getThisFileDir(),0,length(getThisFileDir()) - 1 -length("tokenize.example"));

   graph
   stream Documents = FileSource() {
      param 
         file: $inputFile;
         format:line;
   }

   stream Tokenized = com.ibm.streams.text.analytics::TextExtract(Documents) {
      param uncompiledModules: $applicationDir+"/etc/usetokenizer";
      // The standard tokenizer is faster, but doesn't do the right thing in some cases.
      tokenizer: "multilingual";
      outputMode: "multiPort";
      languageCode: "en";  // For the supported languages, see here
      // This line is optional.  
      externalDictionary: "usetokenizer.StopwordsDict="+$applicationDir+"/"+$stopWordsDict;

   }

   () as sink = FileSink(Tokenized) {
      param file: $outFile;
   }

Summary
The AQL modules and example applications using these AQL moduels are available on Github. The examples will work with either Streams 3.2 to Streams 4.0. I hope they help you get started doing text processing with Streams.

Join The Discussion