Over 1,100 people attended my Tech Talk on Rosie Pattern Language on 30 November 2016. I’m astonished and gratified by the level of interest in the project, and I want to thank everyone who attended!

The charts from my talk, a transcript, and a video replay are available here.

In the chat session that accompanied the presentation, there was a lot of good discussion and many questions. In this post, I will try to answer the questions that I didn’t get to after the talk, and I’ll recap a few that I did. The questions are sorted into topics, to make it easier to skip around while reading.

If I misunderstood your question or you have a follow-up, please leave a comment on this post.

Technical questions about pattern matching

What is Greedy repetition?

Pattern matching operators that express repetition, like * (which matches zero or more instances), are described as greedy if they try to match as many instances as possible. The opposite term is lazy: a lazy repetition operator tries to match as few instances as possible.

Greedy and lazy describe the order in which possible matches are considered by the matching engine. Greedy operators start with the longest possible match. If the overall pattern fails and the matching engine backtracks, the next shortest match is considered.

Rosie’s repetition operators are all greedy. They are also possessive.

A pattern matching operator is called possessive if the matching engine will not backtrack to consider a different match. Regex operators are usually not possessive. That is, regex engines will backtrack (when the overall pattern fails) to each repetition operator to see if it can match a different number of instances instead. This can cause unexpectedly poor performance. See Rosie and regex: questioning anomalous performance.

Rosie’s repetition operators are greedy and possessive. A pattern like B* will match as many instances of B as possible, and the match will never be revisited.

Even though Rosie is based on Parsing Expression Grammars, not Regular Expressions, RPL has some operators in common with Regular Expressions. These include the repetition operators *, +, ?, and {n,m}. A great resource for more information about such operators can be found here.

Rosie is using the same syntax as regex for several operators, but why has “or” been changed from | (pipe) to / (slash)?

The “or” operator in RPL comes from Parsing Expression Grammars and is best described as an “ordered choice”. When matching a / b, if a is matched successfully, then b is never tried. This behavior is different from the regex pipe operator. The difference in notation, due to Ford, is intended to reinforce the difference in semantics.

Does Rosie offer linear time complexity? And what about the JSON overhead in space complexity?

Rosie’s matching algorithm matches in linear time, with one caveat. The time it takes to match is a linear function of the length of the input data when the recursive RPL grammar construct is not used, and often when it is used. In other words, it is possible to write recursive grammars in RPL which require more than linear time.

Rosie compiles RPL down to a low level intermediate language which is then processed by Ierusalimschy’s brilliant lpeg library. And because Rosie’s grammars are structurally identical to lpeg grammars, the complexity analysis in this paper describes RPL as well as lpeg.

Another important property of Ierusalimschy’s lpeg library is that it requires only constant space. There are other approaches to matching Parsing Expression Grammars which can guarantee linear time matching, but at the cost of linear space. The space required by the lpeg engine depends only on the pattern, not on the input data.

Stepping back to consider RPL as a whole, the space consumed by the JSON-encoded output is at worst quadratic in the input size. The worst case scenario occurs in very unusual patterns in which the pattern writer is interested in capturing a series of ever-small substrings of the input. For example:

bash-3.2$ rosie -repl
This is Rosie v0.99i
Rosie> a = "a"
Rosie> a2 = {a a}
Rosie> a3 = {a2 a}
Rosie> a4 = {a3 a}
Rosie> a5 = {a4 a}
Rosie> .match a5 "aaaaa"
{"a5": 
   {"text": "aaaaa", 
    "pos": 1.0, 
    "subs": 
      [{"a4": 
          {"text": "aaaa", 
           "pos": 1.0, 
           "subs": 
             [{"a3": 
                 {"text": "aaa", 
                  "pos": 1.0, 
                  "subs": 
                    [{"a2": 
                        {"text": "aa", 
                         "pos": 1.0, 
                         "subs": 
                           [{"a": 
                               {"text": "a", 
                                "pos": 1.0}}, 
                            {"a": 
                               {"text": "a", 
                                "pos": 2.0}}]}}, 
                     {"a": 
                        {"text": "a", 
                         "pos": 3.0}}]}}, 
              {"a": 
                 {"text": "a", 
                  "pos": 4.0}}]}}, 
       {"a": 
          {"text": "a", 
           "pos": 5.0}}]}}
Rosie> 

In more typical usage, the pattern writer uses the RPL alias to eliminate unneeded sub-matches from the output:

bash-3.2$ rosie -repl
This is Rosie v0.99i
Rosie> alias a = "a"
Rosie> alias a2 = {a a}
Rosie> alias a3 = {a2 a}
Rosie> alias a4 = {a3 a}
Rosie> a5 = {a4 a}
Rosie> .match a5 "aaaaa"
{"a5": 
   {"text": "aaaaa", 
    "pos": 1.0}}
Rosie> 

Comparing Rosie to other solutions

How would you compare use of Rosie to the ELK stack?

The ELK stack (Elastic Search, Logstash, Kibana) is a comprehensive solution for collecting, storing, and visualizing application logs. When Logstash reads raw log data, any number of Logstash plugins can be used to structure and filter the data. Grok is apparently the plugin most depended upon for extracting information from log records, and Rosie is most appropriately compared to Grok.

Grok has a library of named regex patterns, and Rosie supports a library of named PEG (Parsing Expression Grammar) patterns. Both tools let you write your own patterns or modify the ones they ship with. In our experiments, Rosie has proven to be faster than Grok by a factor of around 4. Rosie’s patterns are also more expressive. And Rosie’s patterns can be organized into modules that do not conflict with one another, making patterns easier to share (and to debug).

It is possible to write a Rosie plugin for Logstash that can be used alongside or instead of Grok. Logstash plugins are fairly straightforward, and I have written a “proof of concept” Rosie plugin. But it would take a more experienced Ruby programmer, and preferably one more familiar with Logstash than I am, to write and package a proper one.

If you are interested in a Rosie plugin for ELK, please contact me! You can leave a comment on this post, or add a comment to this open issue.

Why not simply use lex/yacc instead of Rosie?

I cut my teeth on lex and yacc for parsing. I’m sure some people have less than fond memories of these tools from their undergraduate compiler classes. (My first compiler class used the then-new first edition of the Dragon Book!) Rather than focus on lex and yacc specifically, I’ll answer generally about use cases that call for traditional parser technology and ones that, in my opinion, call for a tool like Rosie.

Traditional parsing technology comes from a large body of research in compilers. I would characterize such parsing tools as:

  • Highly customizable, to address the quirks and complexities of programming language syntax
  • With a strong language affinity, meaning that a particular set of tools (e.g. lex/yacc or pyparsing) were created for use by programs written in a particular language (e.g. C and Python, respectively)
  • Requiring actual programming to use (not just pattern writing)
  • Typically not suited for big data, as they parse without tail recursion and can overflow the stack on large inputs

These are generalities, and there are exceptions. But Rosie was designed for a different set of use cases:

  • RPL cannot express the ambiguous grammars of some programming languages. (Neither can regex, but a custom parser can.)
  • RPL has no particular language affinity, and therefore is equally suited (or not) to solutions written in node.js, Python, Java, Go, or other languages.
  • You don’t have to write code to use Rosie. Because patterns are written in RPL, a Go programmer can share their patterns with a Ruby programmer and neither has to understand the other’s program code.
  • Because Rosie leverages lpeg, Rosie does not consume stack space proportional to the input size, and so will not overflow the stack.

Will Rosie be faster than simple substring matching?

Probably not! Then again, it’s hard to know without testing, and such testing would have many variables. You would have to choose at minimum a programming language, a set of patterns, and a representative data set.

Given a choice for each of these variables, the relative performance of technologies for substring matching is likely to depend largely on the details of the language implementation (which language, and which interpreter or compiler), platform (OS and hardware), and I/O method used.

Questions about Rosie’s capabilities

I’m curious what the IPV6 expression looks like in Rosie?

Because RPL expressions are composable, an IPV6 pattern can be written by combining smaller patterns that match pieces of the IPV6 specification. Also, the final IPV6 pattern can be written with whitespace and comments that make its structure evident. The resulting RPL pattern may be longer or shorter than the equivalent Grok expression (probably longer), but it would be more readable and understandable.

Is there a “look behind” operator?

A “look behind” operator is planned.

Can Rosie work in a parallel way?

Rosie was designed for a multi-threading model in which each thread has its own Rosie matching engine. The engines are independent of one another, and in fact, each engine may be configured to match a different set of patterns from the others.

That said, as of this writing (6 December 2016), no multi-threaded tests have been made. If you are interested in helping us prove this capability, please let us know by leaving a comment here or opening an issue in the Rosie repository.

Are there patterns for XML?

To date, no one has written patterns for XML, although clearly RPL is capable of expressing such patterns. While you could write a parser for XML with Rosie, there are already many XML parsers and document converters available. However, I can envision a use case in which some corpus of data is being processed by Rosie, and within this data is a bit of XML. It would be straightforward to write a pattern in RPL that matches a snippet of XML-like text without fully parsing it. With this, you could extract the pieces of the input data that are in XML for further processing by tools that will use the XML.

Does the current Rosie release support generic database access or datasets?

There is currently no support for accessing databases from the Rosie command line program. Using librosie, however, Rosie can be invoked from within programs in many languages that support database access, like Python, Go, or node.js for example.

Can we do search by proximity with Rosie?

Sure. For example, here is a pattern that finds an “X” within 7 characters of an “A”:

bash-3.2$ rosie -repl
This is Rosie v0.99i
Rosie> .debug off
Debug is off
Rosie> prox = {"X" [^"A"]{0,6} "A"}
Rosie> .match prox "XA"
{"prox": 
   {"text": "XA", 
    "pos": 1.0}}
Rosie> .match prox "X1234A"
{"prox": 
   {"text": "X1234A", 
    "pos": 1.0}}
Rosie> .match prox "X123456A"
{"prox": 
   {"text": "X123456A", 
    "pos": 1.0}}
Rosie> .match prox "X1234567A"
Repl: No match  (turn debug on to show the match evaluation trace)
Rosie>

What about matching multi-line patterns?

There are several ways to match multi-line patterns, depending on your use case.

If you are using librosie, i.e. calling Rosie from a Python or node.js or other program, you can call Rosie’s match API with any input string, even one with embedded newlines. Your pattern can match newlines with, e.g. “\n”. The RPL dollar-sign operator, $, matches only at the end of the input string, not at a newline. You have complete control over whether newlines are significant or not.

The Rosie CLI reads input files one line at a time, unless the -wholefile option is enabled, in which case the entire file is given to Rosie’s match routine as a single input string. This works well in many cases, such as finding patterns in source code files or other documents. Certainly, it could consume large amounts of memory when reading large input files, because the implementation reads the entire input string into memory for processing.

Note: The output encoding options color and nocolor are too slow for large inputs. (They are intended for cases where a human is reading the output, such as during debugging.) The output encodings fulltext and json are best suited for processing large data using the Rosie CLI.

Finally, there are use cases in which multi-line patterns are best handled in a processing step that occurs after Rosie has processed the input one line at a time. Consider the case of an application log file in which most of the lines follow the same log entry format, but in which you may find the occasional Java stacktrace. When it comes to matching the stacktrace, it may appear that a multi-line pattern would be the right thing. However, this hypthetical multi-line pattern would only succeed if the entire trace appeared uninterrupted in the input file. With some logging solutions, this is guaranteed, while with others, it is possible for regular log entries to be interleaved with lines from the trace. Let’s consider both cases.

In both cases, we write a pattern for line 1 of the stack trace, and another for the format of the middle lines, etc. The pattern we use for the entire file is one like this hypothetical one, which is like spark.patterns from rpl/spark.rpl:

log.patterns = log.syslog /                 -- typical syslog entry
	       log.exception_at /           -- e.g. "at org.apache.spark.sql..."
	       log.caused_by /              -- e.g. "Caused by: java.lang.ClassNotFoundException: ..."
	       log.exception_start          -- e.g. "Exception in thread \"Container..."

With a pattern like this, each line of the input file will be labeled in the output with which sub-pattern was matched. In other words, by examining the output, we can see which lines matched log.syslog and which matched log.exception_start, etc.

Case 1: Stacktraces appear in their entirety, with no other lines of data interspersed.

In this case, it is easy for the data processing stage that comes after Rosie to group together all the lines from a stacktrace, if this is desired. The lines will be in order, and they will start with a line labeled log.exception_start, and it is clear that the trace has ended when either a log.syslog or another log.exception_start line is encountered.

Case 2: Stacktrace lines may be interleaved with other lines.

Note that a multi-line pattern cannot succeed in this case because of the interleaving. The only option is to use an alternating pattern like log.patterns and process the input one line at a time, then try to reconstruct the traces after Rosie has processe the raw data. To reconstruct stacktraces in this case will require custom logic reflecting the semantics of the input data. Rosie operates on data syntax, and cannot solve problems that need knowledge of the data semantics.

Running Rosie on various platforms

On which platforms does Rosie run? Are there plans to package Rosie to make it easy to install on popular Linux distros, like Red Hat Enterprise Linux, Ubuntu, etc.?

Rosie is currently distributed in source code form, and is known to build successfully on OS X and several popular Linux distributions (see How to Install). There is a Homebrew formula for OS X, and some talented volunteers have offered to help create packages for Debian and Red Hat Linux distributions.

I am actively seeking a Windows programmer who can help with a port. For some Windows users, it may be useful to know that Rosie is reported to work with the Ubuntu Linux subsystem available for Windows 10 Anniversary Edition. Want to help? Please let us know by opening an issue on Rosie's GitHub page or leaving a comment on this post.

Is this going to run in IBM Bluemix?

We hope to have Rosie available in IBM Bluemix cloud environment soon!

Calling Rosie from various programming languages

Are the capabilities of Rosie being implemented in other languages today?

Yes! There is a library called librosie that can be used with a variety of programming languages. Look at the ffi directory of the Rosie project, or this section of the Rosie README.

Today (6 December 2016) you can find some crude sample programs that show how librosie can be used. In the coming weeks and months, many of these samples will be replaced by proper modules, classes, or interfaces. To do this for language X requires assistance from a programmer with expertise in X. If you have expertise in Perl, R, Java, Ruby, or any language that supports libffi, then we welcome your help in making Rosie available from your language of choice.

What about Ruby?

There is a sample program in Ruby, but it is simple and crude. It is my hope that a Ruby programmer will volunteer to turn this sample program into a proper Ruby client for librosie, and create a gem to make it easy to install. Want to help? Please let us know by opening an issue on Rosie's GitHub page or leaving a comment on this post.

What about Perl?

Perl has support for libffi, and so support for librosie should be easy to write. It would be wonderful if a Perl programmer volunteered to create a librosie interface for Perl programmers. Want to help? Please let us know by opening an issue on Rosie's GitHub page or leaving a comment on this post.

What about Java (and Maven)?

Java supports libffi and we are developing a Java interface to librosie. If you are able to help, either to shape the Java package or to create the Maven package descriptor, please let us know by opening an issue on Rosie's GitHub page.

What about R?

If R has a way to use libffi, then it should be easy to use librosie in R. Perhaps an R programmer can help out here? Start by by opening an issue on Rosie's GitHub page.

Miscellaneous

Why the name Rosie?

You could certainly draw a connection to Rosie the robot maid from the television show The Jetsons, and say that Rosie the pattern matcher "tidies up your data".

But in fact, the name is a reference to Rosie the Riveter. As a cultural icon in the U.S., Rosie came to represent all the women who went to work in factories during World War II, filling the jobs left open by men who were serving in the military. Rosie has since become a symbol of women's participation in the economy. We can draw an analogy to the technical sector of today's economy: women can and will work in technology, despite the historical domination of the field by men.

Rosie is an apt name for this particular project because Rosie the Riveter operated a riveting machine in a factory. The work was critical to the production of planes, which were in turn critical for the war effort. Rosie the pattern matching technology may also be a step or two removed from the most visible parts of a project. Rosie will tirelessly mine and structure the data, providing the input to the next step of processing. That step or maybe the next one after it will no doubt produce useful and even critical results. But Rosie was there at the start, making the rest possible.

Do you give talks at universities?

I am happy to give talks at universities and conferences. And for a small fee, I will speak at weddings and birthday parties. 😉

I found a mistake in your slides

Yikes! The process sub-match output on slide 13 (from my Tech Talk on 30 November 2016) is indeed incorrect. I have fixed this (for future versions). Good catch, and thanks for letting me know!

5 comments on"Rosie Pattern Language Q&A"

  1. I didn’t get a chance to ask the question earlier:

    Does Rosie currently support Unicode? A number of regex engines have good Unicode support and Rosie would need that to be used for our purposes.

    • As of the current version (0.99i), Rosie supports Unicode in the most basic way (though full UTF-8 support is coming very soon). You should be able to see by the transcript below that the dot “.” matches a single unicode character, and that characters from the full unicode range are supported in literal strings and of course in the input data.

      Planned support includes only UTF-8 at this time, but I’d like to know if other encodings are desired.
      Two additional stages of support are planned to be delivered for Rosie Version 1.0:

      1. Support for unicode characters to be specified using their codepoint within strings, e.g. “\u0041”.
      2. Character classes based on unicode character properties.

      jjennings$ cat hello
      ¡Hola! 😀
      jjennings$ rosie ‘.{6,6} “😀”‘ /tmp/hello
      ¡Hola! 😀
      jjennings$

  2. Rich Jolissaint March 03, 2017

    Hello Ms Jennings,
    I was looking at LPEG to process log files, and then found Rosie Pattern Language. Thank you for releasing your project to the world!
    We would be working in Lua, specifically LuaJIT in the openresty bundling of nginx.
    Are the underlying LPEG tables exposed to users of Rosie? If not, do you plan to expose them?
    Thank you,
    Rich Jolissaint

    • Hello! Rosie does not currently expose the LPEG patterns, but I plan to do this for Lua users. (I have a “rosie.lua” module in a development branch that loads Rosie, and soon there will be a function in this module that returns the LPEG object.)

      For now, I would be happy to tell you how to access the LPEG objects. I am very interested in this idea because Rosie is designed to allow separation of the compiler from the run-time (matcher) — but today they are together.

      If you would be so kind as to open an issue or question on GitHub, we can continue the discussion there?

      Jamie

Join The Discussion

Your email address will not be published. Required fields are marked *