Get the Code
Published June 21, 2018
In 2012, there were an estimated 2.5 exabytes (2.5 billion Gb) of data being generated every day. With more devices and applications than ever, the amount of data is surely higher today. It is likely that only 0.5% of this data is ever analyzed.
Much of the data being generated is semi-structured data, which includes images and video, but also data residing in CSV, XML, HTML, or JSON formats. This type of data is not directly amenable to analysis, despite the advanced analytics tools that are created every day, including Spark for machine learning and Watson for cognitive computing.
Data science is the art and science of analyzing data. Around 80% of all data science projects is spent preparing the data for analysis. This preparation includes extracting information from semi-structured data using pattern matching and parsing. It also includes normalization, annotation, correlation, and a variety of transformations.
While large sets of human-written textual data are processed using Natural Language Processing techniques, the vast quantities of semi-structured data are typically mined using statistical analysis and machine learning, both made easier with technologies like the R language and the Spark runtime.
Yet, the tools we use for preparing raw data for analysis (with R, Spark, Watson, or what have you) have changed little since the 1970s. We still rely heavily on regular expressions to extract the information we care about from the large pools and rushing streams of semi-structured data.
There are three main pitfalls of relying too much on regular expressions:
These expressions can be difficult to write, and are notoriously difficult to read and maintain.
To extract information from semi-structured data requires a potentially large collection of regular expressions and the ability to compose them in various ways. Consequently, additional tools (and skills) are needed, such as programming languages like Perl or pattern organizers like Grok. And these tools are themselves limited in many ways when it comes to big data.
Modern libraries include extensions to the classic regular expression technology, and these extensions can require exponential time (in the input size). A big data pipeline will clog up and stall if some pieces of data require tens of seconds instead of tens of microseconds to process.
The Rosie Pattern Language (RPL) overcomes the limitations of regular expressions by providing a language that makes it easy to specify the information you want to extract from your data. There are many similarities with regular expressions, so it’s easy to get started. RPL lets you compose simple patterns into complex ones, organize your patterns into packages, and even define transformations to be done on the matched data.
Moreover, RPL is based on Parsing Expression Grammars, which can express recursive structures (like XML and JSON) that regular expressions cannot. And Parsing Expression Grammars can run in linear time in the size of the input data, making them a good choice for processing big data.
The Rosie Pattern Engine is an implementation of an RPL compiler and an RPL runtime environment. Both components are written in the Lua language and use the LPEG package. The engine is a shared object file that can be linked with another application, and there is also a command line interface. It uses RPL patterns to extract information from input data and outputs structured JSON.
This project includes dozens of example patterns, which range from general-purpose patterns that extract timestamps and network addresses to special-purpose patterns that parse a variety of log files. Patterns written in RPL look like programs, but function like regular expressions in the sense that they are used to extract desired information from input text.
In RPL, whitespace is not significant, and there can be comments, too. Therefore, pattern definitions can be formatted for readability and commented for understanding. Also, pattern definitions can refer to other pattern definitions.
Rosie Pattern Engine also includes a read-eval-print loop for developing and debugging patterns interactively.
There are two ways to contribute to this project. You can write patterns in Rosie Pattern Language for parsing specific kinds of files or for extracting particular pieces of information. If you already know (or want to learn) the Lua language, you can write data transformation functions that can be called from the Rosie Pattern Language to normalize or annotate data, for example. Patterns and transformation function contributed to the project become part of the out-of-the-box functionality and are appreciated!
You can also extend the Rosie Pattern Engine itself. You’ll learn about Parsing Expression Grammars and about the Lua language, as well as how the RPL compiler works.
By writing patterns and transformation functions for a particular type of data, you will make it easy for people to extract useful information from that data source.
By extending the Rosie Pattern Engine itself, you will help make the vast quantities of semi-structured data in the world accessible and able to be analyzed. Big data analysis has the potential to optimize energy usage, keep our networks safe, and discover the next breakthrough in medical research, just for starters.
If your business uses analytics, RPL will help you reliably extract useful information from semi-structured data so you can analyze it. If you have a product that generates semi-structured data (log files, social media data, and the like), then RPL will help your customers extract from that data the nuggets that are important to them.
If you make heavy use of regular expressions in any context today, RPL is a tool worth considering. The RPL technology represents a step forward to a more advanced pattern-matching technology endowed with features usually seen only in programming languages, such as interactive development, packages, and program composition.
May 6, 2019
September 23, 2019
March 21, 2019
Back to top