The text toolkit can be used to generate Streams code to wrap a Infosphere BigInsights 2.0 Text Analytics extractor, either one described as source AQL modules or as compiled tam modules. The createTypes.pl script, located in the bin directory of the text toolkit, can generate Streams types that match the types of the extractor’s output views, and optionally, a composite invoking the TextExtract operator.

This can be a useful way to create a starting-point Streams application from a BigInsights Text Analytics extractor. The types and the composite may also be created from a makefile to reduce the work in keeping a streams application in sync with a Text Analytics extractor. This atricle will describe first how to create Streams types from the extractor, then how to make a composite, and finally, will show how to make an end-to-end application that you can compile and run.

You need to run the following command from the toolkit’s bin directory or from the bin directory in copy of the toolkit.

$STREAMS_INSTALL/toolkits/com.ibm.streams.text/bin

The most basic use case for the createTypes.pl script is to create the types of the output views of the Text Analytics modules. This example uses the getNames module included in the FeatureDemo sample application from the text toolkit. The getNames module extracts titles followed by full names from text. The output view of the module is FullNameWithTitle, and it is created as follows:

create view FullNameWithTitle as
extract
F.title as title,
regex /[A-Z][a-z]+\s[A-Z][a-z]+/ on F.fullName as fullName
from FullNameWithTitleMessy F;

There are two fields, title and fullName, both of type span (internally, a span is represented as a begin and end offset into a string). As someone working with Streams, you may not be familiar enough with AQL to determine the corresponding Streams types, or the AQL source may be unavailable (if it is provided as a tam module), leaving you in the dark. This is where the createTypes.pl script can help.

Creating Streams types

Let’s assume that you are in the toolkit bin directory, that the FeatureDemo sample has been copied to your home directory, and that we want to build the application in ~/tryAQL. Then to generate the types, do

./createTypes.pl --uncompiledModules ~/FeatureDemo/data/getNames --outputDir ~/tryAQL/data

(The outputDir parameter is where the compiled .tam file will go when you run the application with Streams. It need not be part of your streams application.) This command compiles and inspects the BigInsights Text Analytics module, and then creates a simple Streams file, sample.spl with the spl types corresponding to each of output views. The entire file in this case is:


type toPrint0getNamesType = rstring title, rstring fullName;

By default, createTypes.pl maps spans to rstrings. Since we’re just going to print them out, strings make sense. But if you want to compare distances between mentions or perform other operations in which you need the offset in the text, you may want the output as tuples. To do this, supply the –noconvertspan option, and the generated file is:


type toPrint0getNamesType = tuple title, tuple fullName;

Similarly, the –inttype and –floattype options allow you to specify which streams int or float type to use.

It’s not generally a good practice to have all your Streams files in the default namespace. Manually adding a namespace is a minor inconvenience if you are editing by hand, but could be difficult if you’re including this command in a makefile that automatically generates the types, so we provide a namespace option. Let’s say you want this in the namespace myaql, in the file named, mytypes.spl (if not supplied, it defaults to sample.spl), and that you want to build the application in the tryAQL subdirectory of your home directory. Make sure the directories exist, and then:


./createTypes.pl --uncompiledModules ~/FeatureDemo/data/getNames --outputDir ~/tryAQL --namespace myaql --outputfile ~/tryAQL/myaql/mytypes.spl

Now you can reference these types (in this case, the type toPrint0getNamesType) in your Streams application. In doing so, your application is insulated from changes in the AQL — adding a field to your output view, changes the type definition, but doesn’t affect the rest of your streams application.

Creating a composite
It may also be convenient to create a composite that applies the BigInsights Text Analytics Module. To create such a composite, add on the –makecomposite option, with an optional –compositename option.


./createTypes.pl --uncompiledModules ~/FeatureDemo/data/getNames --outputDir ~/tryAQL --namespace myaql --outputfile ~/tryAQL/myaql/mytypes.spl --makecomposite --compositeName getNames

For this example, the output file will be:


namespace myaql;
type toPrint0getNamesType = rstring title, rstring fullName;

// a composite for multiTupleMode
public composite getNames(input inputStream;
output toPrint0getNamesStream) {

graph 
    ( stream toPrint0getNamesStream) = com.ibm.streams.text.analytics::TextExtract(inputStream) {
        param
            uncompiledModules: "/path/to/FeatureDemo/data/getNames/";
            moduleOutputDir: "/path/to/tryAQL";
            outputMode: "multiPort";
    }
}

Notice that this is the multiPort composite, you can also supply a –singletuplemode argument to build the single tuple mode version. Now you can invoke this composite from your application.

To see an example invocation, supply the -–main option, and the script will also create a main composite calling the getNames composite.


./createTypes.pl --uncompiledModules ~/tryAQL/FeatureDemo/data/getNames --outputDir ~/tryAQL --namespace myaql --outputfile /homes/hny2/hildrum/tryAQL/myaql/mytypes.spl --makecomposite --compositeName getNames --main

This creates the Main.spl file in the current directory; you’ll have to move it to the right place for your application (in this example ~/tryAQL). Once there, you can compile, run, and then look at the output. To standalone compile (from your ~/tryAQL directory):


sc -T -M Main -t $STREAMS_INSTALL/toolkits/com.ibm.streams.text

Here’s how you’d run it on Chapter 1 of Sense and Sensibility, included as sample data in the toolkit.


output/bin/standalone inputFile=~/FeatureDemo/data/SenseAndSensibility/chapter1.txt

There is similar support for BigInsights 1.4 Text Analytics in deprecated text toolkit.

Join The Discussion