This is the first part in a series of articles discussing the innovative solution named Adaptive Parser project for IBM Streams.

The Adaptive Parser project contains a set of toolkits that aims to parse structured, semi-structured and most notably unstructured data streams. It’s capable of parsing hierarchical data structures and supports customizing the parser’s behavior on any attribute level.

It includes the AdaptiveParser operator in the base toolkit and other toolkits for parsing other data formats, including:

This article  will cover the first steps of starting out with the AdaptiveParser operator.
It is assumed that the reader has already basic knowledge of developing IBM Streams applications.  If you are not familiar with Streams, learn more using the Quick Start Guide.

When should you use the Adaptive Parser toolkit?

The Adaptive Parser toolkit  has been built from the ground up to provide a fully optimal solution for parsing data. The toolkit enables to handle the parsing stage as a first class citizen within a stream while being totally decoupled from the ingest part. Once the parse stage has been unplugged from the ingestion, “smarter” parsing techniques can be applied (like automatic parser generation from a given tuple).  Why is this so important?

In most IBM Streams related projects the developer has to achieve three main goals when handling an input:

  1. Ingest data from various sources (such as files, cloud, network data, sensors)
  2. Parse the the data being ingested based on schema requirements
  3. Transform the data into a friendlier format to make further processing

Some of these steps might be combined into a single step, for example, the FileSource operator both ingests data from a file and parses it depending on the indicated format.  Combining these steps is usually  the right choice for demos and pilots, but a best practice is to manage these three tasks separately.

When separating parsing from ingest steps we gain the following value adds:

  1. The “parsing” stage becomes “data source agnostic” , since it’s not tightly coupled with the “ingest” part
  2. Various optimizations can be applied (such as co-location, placement, parallelism)
  3. The parser is chosen because its “best tool for the job” and not limited by the ingestion capabilities to parse
  4. Single framework for all parsing requirements

This article will show how to use the AdaptiveParser operator for:

  • Basic parsing of streaming data
  • Parsing streaming data with a delimiter
  • Parsing streaming data with a skipper
  • Parsing streaming data combining a delimiter with a skipper

 

Basic parsing of streaming data

The following example shows how to use AdaptiveParser operator in its most simplistic form – by default it uses white space as its delimiter.

In the next sections we will provide more details about configuring various delimiters and skippers.

This first example shows the parser without any parameters:

Adaptive parser example

Image 1: Basic sample of parsing data with AdaptiveParser.

Parsing streaming data with a global delimiter

The following example shows parsing with AdaptiveParser operator using a delimiter to separate between values.

What does the “global” stand for?

In the next articles we will further explain this feature, but to give a little taste: the feature is called “global” in order to support hierarchical structures with directive inheritance support (e.g. JSON format).

The global delimiter should be of type “rstring”, let’s take a closer look at the next example:


stream<parsed_type> ParsedStream = AdaptiveParser(BasicStream) {
  param
    globalDelimiter: ",";
}

Adaptive parser example with a global delimiter

Image2: Basic sample of parsing data using AdaptiveParser with a global delimiter.

The example above demonstrates parsing of comma separated values by a delimiter directive.

Parsing streaming data with a global skipper

The following example shows parsing with AdaptiveParser operator using a skipper to separate between values.

The global skipper should be one of the following types:

  • none: skipper is disabled – all input characters are parsed as values
  • blank: skips spaces and tabs
  • control: skip control characters
  • endl: skip new lines
  • punct: skip punctuation symbols
  • tab: skip tabs
  • whitespace: skip all whitespaces (default)

Let’s take a closer look at the next example:


stream<parsed_type> ParsedStream = AdaptiveParser(BasicStream) {
 param
  globalSkipper : tab;
}

Adaptive parser example with a global skipper

Image3: Basic sample of parsing data using AdaptiveParser with a global skipper.

The example above demonstrates parsing of “tab” separated values by a skipper directive (which means other white-spaces are valid and therefore will not be skipped).

Take note of the fact that “Hello world” was parsed as a single value.

Putting it all together

stream<parsed_type> ParsedStream = AdaptiveParser(BasicStream) {
  param
   globalDelimiter : "," ;
   globalSkipper : blank; // means tabs and spaces are both skipped
}

Putting it all together

Image 4: Putting it all together

The example above demonstrates a typical use case of parsing data formatted with comma separated values (“CSV”).

Take note of the fact that in this case a skipper only complements the delimiter functionality by skipping unwanted white-space characters.

Summary

Although this article has demonstrated only the very basic functionality of the AdaptiveParser toolkit and to the inquiring mind it would seem that the same functionality could be achieved with basic SPL coding, don’t give up on it just yet, since we intend to introduce the full functionality of the AdaptiveParser framework in the next few articles.

View the project on GitHub or download a release.

Many thanks to Laser Nahoom-Kabakov for helping out bringing this article to life.

4 Comments on "Adaptive Parser, fast and flexible parsing toolkit for IBM Streams – part 1"

  1. Samantha Chan May 31, 2017

    Cool! Thanks for this great article!

  2. Raanon Reutlinger June 01, 2017

    +1

  3. Laser Nahoom Kabakov June 01, 2017

    +1

Join The Discussion