With the Java Operator
RutaText, available from https://github.com/IBMStreams/streamsx.nlp, you can extract data from incoming text. As precondition, you need IBM Streams release 3.2 or newer and previously developed UIMA Ruta rules that are packed into a PEAR.
There are many ways to develop PEAR files. Among them are IBM Watson Content Analytics (ICA) Studio and UIMA Ruta workbench.
One example how to create a UIMA Ruta PEAR is part of the toolkit documentation.
Text extraction is one possible step in the natural language processing pipe. Previous steps might be lemma transformation and removal of stop words. There are separate operators for those steps available in the IBM Streams Natural Language Processing (NLP) Toolkit. This article just concentrates on the¬†
UIMA – Unstructured Information Management Architecture, an Apache open source project.
Apache UIMA implements an extensible framework for analysis of unstructured content such as text, audio and video. Annotator components - also known as Analysis Engines (AE) - are the ones that do the extraction work. The framework also covers development tools and a framework to chain multiple components.
Ruta – is a rule based text annotation language.
Apache UIMA Ruta delivers the analysis engine to interpret the rule language and a workbench (Eclipse plugin) for rules development.
CAS – Common Analysis System, an object-oriented data structure to carry the data for analysis together with types and extracts.
UIMA follows the policy to hold all data as CAS. Using UIMA components means transforming data into CAS first. For output of CAS the XMI format is used.
PEAR – Processing Engine ARchive.
The UIMA framework supports to package an Analysis Engine into a PEAR file. Other components can use it from that PEAR.
XMI – XML Metadata Interchange.
A standard for exchanging metadata information. UIMA delivers tools to display XMI content.
In this example, a PEAR file contains UIMA Ruta rules that annotate groups of words in a document that might mean ‚Äútrouble‚ÄĚ according to a dictionary. Additionally, it calculates a trouble count per document and adds it as document annotation.
RutaText operator applies those rules to incoming text documents. The input file is read line by line. Each line of the input file represents one document to analyze and is processed as input tuple in the
RutaText operator, which output is connected into three different directions:
- To the console.
- As text file.
- As XMI file. This file is the serialized UIMA CAS format output. It is useful as input for further UIMA operators like
RutaCasthat might deliver further extracts.
TextStream is a
FileSource operator that forwards the stream attribute
In the sample text file with line breaks you see the following text:
My phone freezes when using this app. super quality very angry about wifi not connecting and no internet
The UIMA Ruta Rules
The Ruta PEAR contains a dictionary, descriptors, and rules. Potential trouble entries are part of dictionary
TroubleList. The rules annotate matches from the input text to the dictionary as
potentialTrouble.¬†They summarize per line the trouble matches. The references section lists a link to a description how to create a UIMA Ruta PEAR. This description is part of the IBM Streams NLP Toolkit documentation. The following code snippet shows the sample Ruta rules.
The RutaText operator using the Ruta Rules
RutaText operator installs the PEAR at runtime and interprets the rules running the UIMA Ruta Analysis Engine. The only required parameter is
pearFile. If the input contains multiple attributes, you need to give the attribute that is subject to
RutaText analysis in the
inputDoc parameter. As UIMA Ruta rules generate a huge amount of implicit annotations by design, optional you can filter that ones you are interested in – with the
outputTypes¬† parameter.¬† The parameter
outputAttributes defines names for the non-XMI output attributes. The parameter
casOut causes the
RutaText operator to serialize CAS to XMI format as output. The value of parameter
casOut is used as name for the XMI output. It automatically adds the extension
.xmi to this name. If you skip the
casOut parameter, you would not get this output format created.
AutoSpan type covers attributes of both types the Ruta rules from the sample PEAR annotate. Depending on the
typeDescription value only parts of the attributes are valid. For the type
uima.ruta.sample1.Main.TroubleDocumentAnnotation only the attributes
potentialTroubleCount are valid. If the type value is
uima.ruta.sample1.Main.potentialTrouble, then the attributes
end are used.
RutaText operator produces output in two attributes:
xmi. Every attribute contains the full extract result in a different form for different subsequent usage. The sample demonstrates usage of both. In practice, if you only need one, you can skip the other for better performance.
Attributes that are present in input and output stream with the same name and type are automatically forwarded. In the sample that is attribute
The custom operator
AnnoDump produces the console output that is shown based on the annotations output attribute. It contains annotations that are extracted from the input text. Lines 1 and 3 contained potential trouble. The extracted trouble text annotations follow together with their begin and end positions relative to the affected line. For line 2, no potential trouble is found.
 trouble=true, potentialTroubleCount=1 [text=phone freezes, begin=3, end=16]  trouble=true, potentialTroubleCount=2 [text=wifi not connecting, begin=17, end=36][text=no internet, begin=41, end=52]
SinkRuta produces file
out.txt in a comma-separated value (CSV) format from the
SinkXMI produces file
out.xmi in XMI format from the
xmi output attribute. UIMA brings separate tools for XMI format display – offline and integrated with Eclipse via UIMA plugin. The following figure shows how the
out.xmi display looks like in the UIMA Eclipse plugin.
The RutaTextCloudSample shows the RutaTextSample from this article that is enhanced to run in the cloud. All sample input files and Ruta rules are included in the
.sab. The data directory needs to be in
com.ibm.streamsx.nlp::RutaText operator you get an alternative to the IBM Streams AQL
com.ibm.streams.text.analytics::TextExtract operator. Instead of AQL rules it uses UIMA Ruta rules packaged in a PEAR. It can produce various output formats. The same format as AQL
com.ibm.streams.text.analytics::TextExtract (if needed for replacement) as well as standardized XMI.
IBM Streams NLP Toolkit
RutaText operator documentation
How to create a Ruta PEAR?
Apache UIMA Ruta Workbench
Apache UIMA Ruta (Rule based text annotation)
IBM Watson Content Analytics (ICA) Studio
Many thanks to Mark-Oliver Heger for implementation of the sample and review of this article.