Isn’t regex technology good enough?

Short answer: In a lot of cases (mostly in small applications), yes, regex [1] is good enough. But for textual pattern matching in the large, meaning when there are lots of patterns, lots of coders and users, or lots of data, regex-based solutions can become fragile, unmaintainable, and unexpectedly slow. The Rosie Pattern Language project aims to provide a better foundation than regex for pattern matching at scale:
  • Expressions in Rosie Pattern Language (“rosex”, perhaps?) look like expressions in conventional programming languages. Whitespace, comments, and clear delineation between operators and literals make it easy to read, maintain, and process with automated tools (like diff).
  • The (forthcoming) module system (lexically scoped namespaces) and macros make it dramatically easier to share pattern packages and to write even more readable patterns.
  • The formalism underlying Rosie is the Parsing Expression Grammar, which can do everything a regular expression can do, and more. A PEG can be recursive, while a regular expression cannot. This allows PEGs (and Rosie expressions) to recognize recursively defined data structures such as JSON, XML, and s-expressions.
  • The algorithm used by Rosie for pattern matching is a linear time algorithm, meaning that it requires time proportional to the size of the input string.
Now, in my opinion, regex are great for small tasks. But as I said in my talk at All Things Open 2016, most modern regex engines have surprisingly variable performance. In fact, in my slides (see page 13), I gave an example of a regex that compares the 30th value in a line of CSV data to a fixed string, “Gold”. On a 94-character input string (see below), Perl takes around a minute to declare no match. Python takes over a minute and a half! Later in this post, I’m going to unpack why this is the case, so that I can show an alternative regex that doesn’t have the same problem, and also how that alternative is virtually identical to its Rosie equivalent.

Wait, is this example contrived?!

Short answer: both yes and no. The regex in question, ^(.*?,){29}Gold, is a pattern that causes exponential backtracking in most modern regex engines — most, but not all! I’m aware of two exceptions: grep and RE2. Those two engines implement versions of a linear time algorithm that has been known since the 1960’s. If you are using grep or RE2, then my regex example will not give you performance problems. But the example is not contrived, for two main reasons:
  1. Most programming environments provide a regex engine that will backtrack exponentially; and
  2. regex found “in the wild” look a lot like my example, crafted for ease of writing, because the performance implications are not obvious.
Russ Cox implemented RE2 for Google Code Search in order to avoid the backtracking done by virtually all other regex implementations. In a superb blog post entitled Regular Expression Matching in the Wild, Cox explains that Code Search would be subject to denial of service attacks if it used a backtracking algorithm (not to mention the less critical problem of a poor user experience when a particularly slow pattern is entered). Now, Code Search was an application that accepted regex from anyone using the service. Perhaps your regex-based application (not using RE2) uses only a carefully curated collection of patterns. In that case, you may not be concerned about backtracking. That is, as long as your regex writers are expert enough to know which patterns could trigger backtracking. And as long as you don’t inadvertently “import” regex that were not so expertly crafted. But if you are processing big data, or if you allow users to enter their own regex when using your application, I personally recommend changing to an engine like RE2. Though you may also consider a regex alternative like Rosie Pattern Language! Let’s take a close look at why backtracking occurs, how to avoid it in regex, and why this isn’t a problem with Rosie.

How exactly does this excessive backtracking happen?

Short answer: slowly. Here is a Python version of the Perl example from my talk, taken from a post by Marc-Arthur Pierre-Louis:
t = "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,Bronze,Bronze,Gold,Silver"
m = re.match("^(.*?,){29}Gold",t);
The function re.match is called with an expression that skips the first 29 fields in the CSV string t and compares the next 4 characters to Gold. It will fail because the 30th field is “Silver”. In the regex, .*? is a lazy version of .*, meaning that it will match the shortest possible sequence instead of the longest. [2] The .*?, initially matches “1,”, and repeating this 29 times, the engine is left trying to match the remainder of the pattern, Gold against the input “Silver”. This fails, of course, and so the engine backtracks because it had a choice of how much each .*?, would match. The 29th iteration of .*? matched “Gold,” and it cannot match a longer substring (because there are no more commas). But the 28th iteration matched “Bronze,” and it could have matched more. If the 28th iteration had matched “Bronze,Gold,” instead, would the pattern as a whole succeed? No, and there are no other possibilities for the 28th iteration. So, the engine goes back to the 27th iteration, which initially matched “Bronze,” and tries instead “Bronze,Bronze,” and then “Bronze,Bronze,Gold,”, only to find that neither succeeds. And here’s the key to understanding why the slowdown is exponential: for each candidate match of the 27th iteration, there are multiple possible matches for the 28th iteration. You can see where this is going. And when you look carefully, you see that each character of the input string has to be examined a great number of times during this process! The recurrence equation describing this behavior shows that the total number of steps required is proportional to 2^n (2 to the n power) where n is the length of the input string. That’s why it takes so long to match such a short pattern against such a short input string.

Can I avoid backtracking altogether?

Short answer: Nope. Even Rosie backtracks sometimes. For example, all pattern matching technologies that have an alternation operator will backtrack. The vertical bar operator (|) in regex and the slash operator (/) in RPL express alternatives: if the expression before the operator fails, then the one after the operator is tried. A key difference between RPL and regex is that in RPL, backtracking is bounded. The alternation operator in RPL is called ordered choice because once an alternative succeeds, the matching engine moves forward and will never return to reconsider the alternatives. Similarly, the repetition operators in RPL are greedy, consuming as much input as they can, and the matching engine will not backtrack to a repetition operation to see what would happen if it had consumed less. Regex alternation and repetition operators cause (most) regex engines to backtrack a lot — in the worst case, exponentially much, as we have just seen.

Ok, so how do I avoid exponential backtracking?

Short answer: switch to RE2 or write your regex differently. I’ll show now that writing the regex differently is equivalent to writing it in Rosie instead. We have to avoid .*? (and the related .*) because its ambiguity is the source of the problem. We need a pattern that matches each field in our CSV string by explicitly looking for the comma at the end of each field. There are many ways to express this, but one way is to use the complemented character set [^,] which matches any character that is not a comma. Our regex example, rewritten, becomes: m = re.match("^([^,]*,){29}Gold",t); The above line of Python code executes in a tiny fraction of a second (and should be fast in every other regex engine). The equivalent to this in Rosie is almost identical (white space added for readability), and also takes a tiny fraction of a second to run: ([^,]* [,]){29,29} "Gold" If our input data were in a file like example.csv, then we can use this pattern with Rosie as follows:
bash-3.2$ rosie -manifest - '([^,]* [,]){29,29} "Gold"' ./example.csv
bash-3.2$ rosie -manifest - '([^,]* [,]){29,29} "Silver"' ./example.csv
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,Bronze,Bronze,Gold,Silver 
bash-3.2$ 
In the example above, we didn’t need any of the 200+ built-in Rosie patterns, and so we used the option -manifest - to tell Rosie not to load and compile them. [3] The first use of Rosie, looking for “Gold”, does not find a line matching that pattern, and so there is no output. The second invocation, looking for “Silver”, finds a match and illustrates that the pattern is working as expected. Two questions arise:
  1. If it’s possible to write an expression that avoids the exponential backtracking, then why should I care about this example?
  2. Hey, what if I write the equivalent to the slow regex (which uses .*) in Rosie? Won’t that cause the same performance problem?
For the first question, we can examine some of the online regex guides, best practices, and posted questions. We find that .* is used frequently in nested ways (including repetitions), thus opening the door to performance anomalies. Regex is a language in which it is easy to write unexpectedly poor performing expressions. The inclusion of an exponential backtracking * operator in a regex implementation makes it unsuitable for big data — and this describes virtually every regex engine in wide use. And as to the second question, Rosie’s * operator is greedy and does not backtrack. Thus, .* in Rosie will consume the rest of the input. Indeed, the expression .* is useful only for that purpose. As you may recall, anything that can be expressed with a regular expression can be expressed with a parsing expression grammar. So you can write an equivalent in Rosie to the backtracking behavior of regex, but it takes some work to do so. And why would you want to? It would be like deliberately coding a slow loop in a conventional programming language when there’s a faster alternative. Rosie encourages the writing of high performance patterns, in part because all of Rosie’s operators have performant implementations, and in part because it takes extra effort to write a super slow pattern.

Can we wrap this up now?

Short answer: yep! Among the reasons I created Rosie Pattern Language, its formal basis in Parsing Expression Grammars, and its linear time matching algorithm [4] were important considerations. In practice (as well as in theory), Rosie exhibits predictably good performance. [5] But reliably good performance was only one consideration in the design. As I have explored here in this post, there are avoidable performance anomalies with the regex engines in wide use today, notably excepting RE2. If every scalable regex-based pattern matching solution were based on RE2, however, we would still want Rosie, for the following reasons:
  • Parsing Expression Grammars are more expressive (powerful) than regex, and can recognize recursively defined data formats.
  • The syntax and semantics of Rosie Pattern Language have a lot of overlap with regex, making it easy to learn.
  • The forthcoming (at the time of this writing, November 2016) module system allows collections of patterns to be easily shared openly/globally and of course privately across teams.
  • RPL is language independent. You don’t need to know Python or Ruby or Perl or Go or Swift in order to compose patterns and invoke matching.
  • RPL has many attributes of generic programming languages, making it easier than regex to read and maintain, and enabling a variety of programming tools to be applied to patterns (e.g. macros, source code management, static analysis).
Comments on this post are welcome, as well as bug reports and other issues on Rosie Pattern Language.

Notes

[1] By regex, I mean the modern notion of regular expressions, which are patterns written in a language based on the theory of regular languages, but which typically go beyond the theory to include non-regular extensions such as backreferences. [2] The greedy version can also be used here. In that case, the entire expression is "^(.*,){29}Gold". (Importantly, whether the operator is greedy or not will not affect whether the match succeeds or not; it only affects the captured text.) Backtracking also occurs with the greedy expression. However, it does less backtracking than the lazy version, and yet the lazy version is the one we find more often “in the wild”. Here’s a puzzle to test your regex expertise: Why is this the case? [3] When invoked at the command line, Rosie loads all the RPL files listed in a MANIFEST file. The loading and compiling of hundreds of patterns is a process which has not been optimized yet. The cost of this process at start up is around 0.25 seconds at present (Rosie v0.99h). For big data sets and for streaming, this cost is immaterial. For other uses of Rosie, a reduction in start up time would be useful and should be possible, e.g. by pre-compiling patterns and persisting them to storage. In addition, the RPL compiler has not been optimized at all, and we have not yet tried the well-regarded Lua JIT platform as an alternative to interpreted Lua. [4] The PEG matching engine, called lpeg, that Rosie uses was designed and implemented by Professor Roberto Ierusalimschy. The engine is implemented in C and uses a beautiful parsing virtual machine design that is efficient in both processing time and memory. [5] For example, on application log files, Rosie out performs grok by a factor of around 4 (see page 29 of this slide set).

1 comment on"Rosie and regex: questioning anomalous performance"

  1. […] A pattern matching operator is called possessive if the matching engine will not backtrack to consider a different match. Regex operators are usually not possessive. That is, regex engines will backtrack (when the overall pattern fails) to each repetition operator to see if it can match a different number of instances instead. This can cause unexpectedly poor performance. See Rosie and regex: questioning anomalous performance. […]

Join The Discussion

Your email address will not be published. Required fields are marked *