Part 1 of this series introduced the BigInsights Text Analytics web tool and discussed how to integrate it with Streams. It demonstrated how to create an extractor in the BigInsights web tool and create a Streams application that uses the extractor. This article will build on what was discussed therein and highlight some new features in Streams 4.2. Specifically, we will discuss how to update resources used by the TextExtract operator within a running Streams application.

Prerequisites

This article assumes a basic understanding of Streams and BigInsights Text Analytics. It also assumes that you have read part 1 to understand basic concepts.

You will need:

  • Streams 4.2 or later.
  • BigInsights Text Analytics web tool. A standalone version of the BigInsights web tool is included in Streams as of version 4.2.¬† Follow these instructions to get an instance of the web tool running locally.
  • Version 2.7 or greater of the com.ibm.streamsx.inet toolkit to run the sample.

Review

In part 1, we created a BigInsights extractor for a simple pattern: mentions of IBM and similar organizations and their offerings. This extractor was called the Mentions extractor:

mentions-overview
The Mentions extractor created in the Text Analytics web tool.

 

We then created a Streams application called ProductSearch that used this extractor to check news articles for occurrences of this pattern.  The application displayed the results on a web page, highlighting the match where it was found:

mentions-web-page

Imagine, though, that while the application is running, we would like to update it to also search for references to a newly announced IBM offering or event, such as the World of Watson conference. In earlier versions of Streams, you would have to stop the application, update the extractor with the new keywords, and then restart it.

In Streams 4.2, the TextExtract operator has been enhanced to receive updates while it is running. This article will show you how to extend the application to take advantage of this enhancement.

 

Overview

Let’s review the problem we are trying to solve in a bit more detail. We have an extractor that is run by the TextExtract operator.¬† This extractor contains a dictionary, or list of keywords that it uses to detect a specific pattern. We would like to be able to add keywords to one of the dictionaries in the extractor at runtime.¬† This can be done only if the extractor and the operator are both configured to support dynamic updates before the application is launched.

To do this:

1) The dictionary contents must be separated from the extractor and loaded from a file. A dictionary that has been separated in this way is called an external dictionary.

2) Configure the TextExtract operator to load and receive updates to the dictionary file.

So, we will demonstrate how to implement these steps using the Mentions extractor we created in the first article.  While dictionaries are the focus of this article, similar steps apply to updating a table.

Defining the dictionary

Return to the web tool to edit the Mentions extractor we created in part 1.  Recall that it uses two dictionaries, one named Offerings and another named Organizations. We want to change the Offerings dictionary so that we can add keywords to it while it is being used within Streams. We will do this by editing the AQL for the dictionary.

Select the “Offerings” dictionary in the web tool and click “Edit AQL” from the context menu, as shown below. Make sure only the “Offerings” dictionary is selected:

edit-aql-1

The AQL editor will open:

editor-module-name

Make the following changes:

1) Scroll down to the definition of the Offerings dictionary and change the line:

create dictionary Offerings_dict from file 'Offerings.dict'
with case insensitive

To:
create external dictionary Offerings_dict required false with case insensitive;

2) You will notice the “AQL Resources” pane below the editor. Edit this pane as follows:

  • “Path” column should be empty.
  • “Resource Name” column should contain the fully qualified name of the dictionary, which is of the form moduleName.dictionary_name. You can find the module name in the first line of the AQL file:

module-decl The dictionary name as seen in the previous step is “Offerings_dict”. So in this example, the “Resource Name” should be tauser__TextProject___Export.Offerings_dict

  • Change the Type from Dictionary to Dictionary (External) in the Type column.

Verify that your editor’s contents are similar to this screenshot:

final-editor-annotated

3) Click “Save”. You will be prompted to enter a name, call this new dictionary “Offerings_External“.

Your canvas should now look like this:

canvas-after-edit

4) You now need to replace the “Offerings” dictionary in the Mentions extractor with the “Offerings_External” dictionary you just created:

 

edit-replace-with-external-15) Save the new Mentions extractor and export it using the “Export…” feature of the context menu, accepting the default options. A zip file will be downloaded containing the exported extractor.

export-raw

 

Configuring the TextExtract operator for dynamic update

Modify the ProductSearch application from part 1 to use the extractor you just downloaded.  If needed, review the instructions in part 1 on how to configure the TextExtract operator to use an extractor exported from the web tool.

Configure the operator to load the dictionary

Since we separated the Offerings dictionary from the extractor, we must now specify its location to the TextExtract operator using the externalDictionary parameter. Create a text file and add the initial list of Offerings to the file, and specify the path to that file using this parameter. This parameter expects input of the form:

“fully qualified dictionary name = path to dictionary file”.

For example:

  externalDictionary : "tauser__TextProject__Export.Offerings_dict=etc/Offerings_dict_initial.dict" ;

Note: For updates to be processed correctly, there must be a newline at the end of the file.

The next step is to configure the operator to accept updates to the Offerings dictionary.  In order to do so, we need to be able to send the new keywords to the  operator. This is done using a second input port, which is called the resources port.

Each tuple received on the resources port will have:

  • an rstring attribute called dictionaryName that contains the fully qualified name of the dictionary to be updated.
  • Another rstring attribute called dictionaryContents. This attribute contains the new entries to add. Multiple entries can be added at once if they are separated by a newline.
  • An attribute named action that specifies whether or not the dictionary should be replaced with the new entries or if they should be appended to the dictionary.

After sending the updates, a window punctuation must be sent in order for the changes to take effect.

There are a couple ways of sending updates to the resources port.¬† You could use a DirectoryScan operator paired with a FileSource to monitor a directory for updates.¬† Whenever you wanted to update a dictionary, simply add a new file containing the content you want to add to the directory.¬†¬†If additional customization is needed, you could use a Custom operator instead.¬† In our application, we’re going to use the DirectoryScan/FileSource combination.

Here’s the updated application graph:

 

c-new-port-graph

The highlighted portion of the graph shows the new input for the resources port. Below is the SPL source:

       stream<rstring fileNames> UpdateScan= DirectoryScan()
           {
                param
                    directory: "updates"; //monitor the updates directory for new files
          }
        
          stream <rstring dictionaryContents, rstring dictionaryName, ACTION action> NewOfferings = FileSource(UpdateScan) {
             param
                   format: line; //read each new word one at a time
               output
                 NewOfferings: dictionaryName = $offerings_dict_name, action = UPDATE; //the dictionary name and desired action
         } 
    

The UpdateScan operator streams scans the “updates” directory for new files, and sends the file name to the NewOfferings operator, which in turn, reads the file one line at a time. ¬† For each line in the file, an output tuple is created whose attributes match those expected by the resources port,¬† as discussed above. Since the FileSource operator sends a window punctuation at the end of each file, this punctuation will trigger an update of the dictionary used by the extractor.

Here is the final configuration of the TextExtract operator:

         stream<MentionsExtractorOutput, InputData> TextExtractOutputStream =  TextExtract(Input ; NewOfferings)
        {
            param
                moduleSearchPath : "etc/updated_mentions_extractor" ;
                inputDoc : "inputLine" ;
                outputViews : "Mentions" ;
                outputMode : "multiPort" ;
                externalDictionary : $offerings_dict_name + "=etc/Offerings_dict_initial.dict" ;
        }

 

After making these changes, we are ready to run the application.

 

Running the application

Recall that the application scanned headlines to check for mentions of relevant organizations and their offerings.¬† We’ve modified it to load the list of offerings from¬† a file at startup, and to monitor the “updates” directory for additional offerings.

To¬† demonstrate that we can add additional offerings at runtime, let’s edit our initial Offerings dictionary so it has only a few entries, and we’ll add some more entries at runtime. As shown below, the Offerings_dict_initial.dict file only contains 2 IBM related keywords:

 

a-initial

We have some additional keywords we’d like to add in the new_offerings file:

additional

Compile and launch the application, and then go to your browser at http://<pe_host>:9899/textAnalytics.

You should see output like this:

 

beforeNotice that “IBM World of Watson” is present in the input but is not highlighted.

After copying the “new_offerings.txt” file into the “updates” directory, we can see that the new offerings are now detected:

after

 Summary

This article has demonstrated how to configure the TextExtract operator to receive updates at runtime. The basic steps that need to be followed are:

  1. Ensure that the dictionary that you would like to update is loaded from a file. This configuration can be done using the web tool.
  2. Create/update your extractor using the dictionary from step 1, and export it when done.
  3. Configure the TextExtract operator to load the extractor. (Steps 1-4 below are review from part 1).
    1. Unpack the generated archive to a directory, and use the moduleSearchPath parameter in the TextExtract operator to point to that directory.
    2. Set the names of the attributes in the output stream of the TextExtract operator to the same names as the output columns in your extractor. (Tip:Replace spaces in column names with underscore ‚Äė_‚Äô).
    3. Set the outputViews parameter to the name of your extractor.
    4. Set outputMode parameter to ‚ÄúmultiPort‚ÄĚ.
    5. Set the externalDictionary parameter to the location of the initial contents of the external dictionary.  Make sure the file has a trailing newline.
    6. Add a second input port to the TextExtract operator to receive updates.
  4. Determine how the updates will be sent to the operator.  Using a Custom or  FileSource was discussed, but you could also receive updates from a messaging server or through HDFS.  Ensure a punctuation is sent to trigger an update.

Similar steps can be followed to update a table, for example, use the externalTable parameter to specify the path to the table file.  See the documentation for more information.

 

The sample TextAnalyticsDemo project on GitHub has been updated to include the application shown here.

Additional information

Join The Discussion