As a technologist and researcher working on advanced technology projects for IBM’s Cloud Computing division, Jamie focuses on problems that are found outside the scope of product development but that impact how well products work or how easy they are to use. “There are some amazing capabilities available on the Bluemix platform,” she says, “and...
Rosie Pattern Language
Rosie Pattern Language is a supercharged alternative to regular expressions (regex), matching patterns against any input text. Rosie ships with hundreds of sample patterns for timestamps, network addresses, email addresses, CSV, JSON, and many more.
In 2012, there were an estimated 2.5 exabytes (2.5 billion Gb) of data being generated every day. With more devices and applications than ever, the amount of data is surely higher today. It is likely that only 0.5% of this data is ever analyzed.
Much of the data being generated is semi-structured data, which includes images and video, but also data residing in CSV, XML, HTML, or JSON formats. This type of data is not directly amenable to analysis, despite the advanced analytics tools that are created every day, including Spark for machine learning and Watson for cognitive computing.
Data science is the art and science of analyzing data. Around 80% of all data science projects is spent preparing the data for analysis. This preparation includes extracting information from semi-structured data using pattern matching and parsing. It also includes normalization, annotation, correlation, and a variety of transformations.
While large sets of human-written textual data are processed using Natural Language Processing techniques, the vast quantities of semi-structured data are typically mined using statistical analysis and machine learning, both made easier with technologies like the R language and the Spark runtime.
Yet, the tools we use for preparing raw data for analysis (with R, Spark, Watson, or what have you) have changed little since the 1970s. We still rely heavily on regular expressions to extract the information we care about from the large pools
and rushing streams of semi-structured data.
There are three main pitfalls of relying too much on regular expressions:
1. These expressions can be difficult to write, and are notoriously difficult to read and maintain.
2. To extract information from semi-structured data requires a potentially large collection of regular expressions and the ability to compose them in various ways. Consequently, additional tools (and skills) are needed, such as programming languages like Perl or pattern organizers like Grok. And these tools are themselves limited in many ways when it comes to big data.
3. Modern libraries include extensions to the classic regular expression technology, and these extensions can require exponential time (in the input size). A big data pipeline will clog up and stall if some pieces of data require tens of seconds instead of tens of microseconds to process.
The Rosie Pattern Language (RPL) overcomes the limitations of regular expressions by providing a language that makes it easy to specify the information you want to extract from your data. There are many similarities with regular expressions, so it’s easy to get started. RPL lets you compose simple patterns into complex ones, organize your patterns into packages, and even define transformations to be done on the matched data.
Moreover, RPL is based on Parsing Expression Grammars, which can express recursive structures (like XML and JSON) that regular expressions cannot. And Parsing Expression Grammars can run in linear time in the size of the input data, making them a good choice for processing big data.
The Rosie Pattern Engine is an implementation of an RPL compiler and an RPL runtime environment. Both components are written in the Lua language and use the LPEG package. The engine is a shared object file that can be linked with another application, and there is also a command line interface. It uses RPL patterns to extract information from input data and outputs structured JSON.
This project includes dozens of example patterns, which range from general-purpose patterns that extract timestamps and network addresses to special-purpose patterns that parse a variety of log files. Patterns written in RPL look like programs, but function like regular expressions in the sense that they are used to extract desired information from input text.
In RPL, whitespace is not significant, and there can be comments, too. Therefore, pattern definitions can be formatted for readability and commented for understanding. Also, pattern definitions can refer to other pattern definitions.
Rosie Pattern Engine also includes a read-eval-print loop for developing and debugging patterns interactively.
Why should I contribute?
There are two ways to contribute to this project. You can write patterns in Rosie Pattern Language for parsing specific kinds of files or for extracting particular pieces of information. If you already know (or want to learn) the Lua language, you can write data transformation functions that can be called from the Rosie Pattern Language to normalize or annotate data, for example. Patterns and transformation function contributed to the project become part of the out-of-the-box functionality and are appreciated!
You can also extend the Rosie Pattern Engine itself. You’ll learn about Parsing Expression Grammars and about the Lua
language, as well as how the RPL compiler works.
What technology problem will I help solve?
By writing patterns and transformation functions for a particular type of data, you will make it easy for people to extract useful information from that data source.
By extending the Rosie Pattern Engine itself, you will help make the vast quantities of semi-structured data in the world accessible and able to be analyzed. Big data analysis has the potential to optimize energy usage, keep our networks safe, and discover the next breakthrough in medical research, just for starters.
How will Rosie Pattern Language help my business?
If your business uses analytics, RPL will help you reliably extract useful information from semi-structured data so you can analyze it. If you have a product that generates semi-structured data (log files, social media data, and the like), then RPL will help your customers extract from that data the nuggets that are important to them.
If you make heavy use of regular expressions in any context today, RPL is a tool worth considering. The RPL technology represents a step forward to a more advanced pattern-matching technology endowed with features usually seen only in programming languages, such as interactive development, packages, and program composition.
Rosie blog posts
Discover how Rosie Pattern Language brings computer science and linguistics together.
Discover how the Rosie Pattern Language can improve performance for textual pattern matching when there are numerous patterns, coders and users, or data.
The All Things Open conference in Raleigh on October 26 showcased a host of open source products and vendors. Many big names were there — including Facebook, Microsoft and IBM, of course! developerWorks and developerWorks Open were heavily represented. Participants who came to the dW Open booth were interested in Bluemix and particularly OpenWhisk and...
The Rosie Pattern Language (RPL) provides a unique way to quickly develop custom patterns for matching your data. This blog post shows you how to develop RPL patterns for "comma separated value" (CSV) files, a data format that is very common, despite the fact that it is not standard and that it comes in...
Join Jamie Jennings at All Things Open to learn more about her Rosie Pattern Language project and see how you can contribute.
This is the first of a three-part blog series about developing Rosie Pattern Language (RPL) patterns. Part 2 will develop robust patterns for parsing CSV files, and Part 3 will demonstrate how to automatically generate RPL patterns. Update on 14 Nov, 2016: The command to launch Rosie has changed to bin/rosie as of November 6,...
Learn the specifics of Rosie Pattern Language, which evolves the processing of raw semi-structured data and removes the limitations of regular expressions by providing a language that makes it easy to specify the information you want to extract from your data.