This is the second of a three-part blog series about developing Rosie Pattern Language (RPL) patterns. Part 1 illustrated RPL patterns for parsing application log files; Part 3 will demonstrate how to automatically generate RPL patterns.

Update on 14 Nov, 2016: The command to launch Rosie has changed to bin/rosie as of November 6, 2016, with the release of version v0.99g. The examples in this blog show the old command, ./run, which is deprecated.

The Rosie Pattern Language (RPL) provides a unique way to quickly develop custom patterns for matching your data. This blog post shows you how to develop RPL patterns for “comma separated value” (CSV) files, a data format that is very common, despite the fact that it is not standard and that it comes in many flavors.

Why use Rosie Pattern Language for this task?

There are three good reasons to use Rosie to parse CSV data:

  1. The CSV data is embedded with data of other formats, perhaps appearing within an application log file or an email. (The same reasoning suggests using Rosie to parse JSON in the same situations.)
  2. You want control over the parsing: for example, you want to support double-quoted fields but not single-quoted fields in your application, or you want the ability to extend the parser if you encounter a new flavor of CSV.
  3. You are already using Rosie, so parsing CSV data with Rosie means having just one data parsing tool instead of many.

Let’s look at some data

We’ll use some short data files as we develop a robust CSV parser, which will eventually become the csv.rpl file in the Rosie Pattern Language repository.

Our sample data set consists of three short files. The first, sample_comma.csv, has six rows of comma-separated data. The contents of that file are:

2011-11-11,X12334,customerdetails
Bob,Dyson,true,23,"23,456",0,"1,543,543.36","""Interstate 80 Revisted"""
Jimmy,Pace,
Saul,McCartney,false,67,"89,125","4,324,307.5","986,545.12"
Elvis,John,
Eric,Clayton,true,461,"33,101",1.11,100.01,"\"Greatest Hits\""

Note that the rows do not have equal numbers of fields. Also, some rows end with a comma (meaning that there is an additional empty field at the end of that row). Some fields are quoted, while others are not. Finally, we can see two different ways of including quotation marks within a quoted field: the second line uses a pair of double quote characters, and the last line uses the sequence backslash quote.

In general, comma-separated-value data can be formatted in all these ways and more. (See, for example, this Wikipedia article for more information.) In addition, we often see data labeled as “csv” but with semi-colon or pipe characters as the field delimiter. The other files of sample data in the Rosie repository use those delimiters.

Parsing unquoted fields

An unquoted field cannot contain a comma, since that character separates one field from the next. As a result, the pattern to match an unquoted field is easy. Let’s name it uqf for unquoted field:

uqf = {!"," .}*

The pattern reads as follows: while not looking at a comma, match any character; repeat zero or more times. This is a common idiom in RPL, used whenever we want to match all characters up to (but not including) a terminator. Note that the terminator does not have to be a character. It can be a string or a pattern. For example, to match all characters up to the words “End transcript”, we could write:

example1 = {!"End transcript" .}*

Or, to match all characters up to a network address, discard those characters, and then capture the network address, we could write:

find_addr = {!basic.network_patterns .}* basic.network_patterns

We could use the Rosie read-eval-print loop to define and test such a pattern:

jennings$ ./rosie -repl
This is Rosie v0.99e
Rosie> find_addr = {!basic.network_patterns .}* basic.network_patterns
Rosie> .match find_addr "Hello world!  Let's see what's new at http://developer.ibm.com!"
{"find_addr": 
   {"subs": 
      [{"basic.network_patterns": 
         {"subs": 
            [{"network.url": 
               {"subs": 
                  [{"network.protocol": 
                     {"text": "http:\/\/", 
                      "pos": 39.0}}, 
                   {"network.fqdn": 
                     {"text": "developer.ibm.com", 
                      "pos": 46.0}}], 
                "text": "http:\/\/developer.ibm.com", 
                "pos": 39.0}}], 
          "text": "http:\/\/developer.ibm.com", 
          "pos": 39.0}}], 
    "text": "Hello world!  Let's see what's new at ht...", 
    "pos": 1.0}}
Warning: 1 unmatched characters at end of input
Rosie> 

Returning from this digression, let’s test our pattern for matching unquoted fields in CSV files:

Rosie> uqf = {!"," .}*
Rosie> .match uqf "The quick brown fox, the lazy dog, and a giant rabbit"
{"uqf": 
   {"text": "The quick brown fox", 
    "pos": 1.0}}
Warning: 34 unmatched characters at end of input
Rosie> 

If we had only unquoted fields, we could finish our work by using uqf in a repeating pattern, such as:

Rosie> row = uqf {"," uqf}*
Rosie> .match row "The quick brown fox, the lazy dog, and a giant rabbit"
{"row": 
   {"subs": 
      [{"uqf": 
         {"text": "The quick brown fox", 
          "pos": 1.0}}, 
       {"uqf": 
         {"text": " the lazy dog", 
          "pos": 21.0}}, 
       {"uqf": 
         {"text": " and a giant rabbit", 
          "pos": 35.0}}], 
    "text": "The quick brown fox, the lazy dog, and a...", 
    "pos": 1.0}}
Rosie> 

Alas, we will surely encounter quoted fields in CSV data, so let’s define a pattern for those.

Parsing quoted fields

Starting simple

For now, we will consider quoted fields that do not contain embedded quotation marks. The pattern we want is one that matches a quotation mark, followed by anything that is not a quotation mark, followed by a quotation mark. This pattern is so simple that we will define it three different ways, in order to illustrate various RPL features. We will continue to use the Rosie read-eval-print loop to develop our patterns interactively:

Rosie> qf = "\"" {!"\"" .}* "\""
Rosie> .match qf "\"foo, bar, baz\""
{"qf": 
   {"text": "\"foo, bar, baz\"", 
    "pos": 1.0}}
Rosie>

In the transcript above, we have defined qf to match a field delimited by double quotes. Because string literals in RPL are written using double quotes, we must escape the double quote character using a backslash. As a result, "\"" is how we write a one-character string containing a double quote.

Strings containing escape characters can appear strange. A better notation for a single character is to use the RPL character set syntax, which is very similar to the regular expression syntax and uses square brackets:

Rosie> qf = ["] {!["] .}* ["]
Warning: reassignment to identifier qf
Rosie> .match qf "\"foo, bar, baz\""
{"qf": 
   {"text": "\"foo, bar, baz\"", 
    "pos": 1.0}}
Rosie>

This looks a little better. The middle section could be simplified, however. The RPL character set syntax supports set complement, so we could write [^"] to match any character that is not a double quote:

Rosie> qf = ["] [^"]* ["]
Warning: reassignment to identifier qf
Rosie> .match qf "\"foo, bar, baz\""
{"qf": 
   {"text": "\"foo, bar, baz\"", 
    "pos": 1.0}}
Rosie>

Adding support for literal quotation marks

Next we add support for double quote marks embedded inside quoted fields. We’ll start with the quote-quote syntax, as shown in this data row, where the last field is a quoted field that itself contains a double quote as its first and last characters:

Bob,Dyson,true,23,"23,456",0,"1,543,543.36","""Interstate 80 Revisted"""

A good way to think about escape characters is to consider that each character in a text field is either an escaped character or a normal character. Since the dot (that is, .) will match any character, we can write a pattern that matches any of the allowable escape character sequences, or (if that fails) any single character.

In the following transcript, recall that the Rosie command .match requires a pattern and a quoted input string to match against. To put double quote characters within the input string, we must escape them using the RPL syntax, which requires a backslash.

Note: In this transcript, I turn off the debugging output for brevity.

Rosie> .debug off
Debug is off
Rosie> escq = "\"\""
Rosie> .match escq ""
Repl: No match  (turn debug on to show the match evaluation trace)
Rosie> .match escq "\""
Repl: No match  (turn debug on to show the match evaluation trace)
Rosie> .match escq "\"\""
{"escq": 
   {"text": "\"\"", 
    "pos": 1.0}}
Rosie> .match escq "\"\" and more"
{"escq": 
   {"text": "\"\"", 
    "pos": 1.0}}
Warning: 9 unmatched characters at end of input
Rosie> 

We define escq (above) as a literal string of two double quote marks. For illustration, we first show that escq will not match the empty string ("") or a single quote ("\""). As desired, it does match two quote marks ("\"\"").

The contents of a quoted field, then, is a pattern that we will call c which matches either an escaped quote mark or any single character, repeated zero or more times:

Rosie> c = {escq / .}*
Rosie> .match c "\"\" and more"
{"c": 
   {"subs": 
      [{"escq": 
         {"text": "\"\"", 
          "pos": 1.0}}], 
    "text": "\"\" and more", 
    "pos": 1.0}}
Rosie> 

In the test of c above, we see that escq appears as a sub-match of the pattern c. When we later process the results of the match, we will refer to the field of c called text, and we will ignore the escq sub-match because it adds no additional information. We can tell Rosie to omit this sub-match by changing the definition of escq to be an alias:

Rosie> alias escq = "\"\""
Warning: reassignment to identifier escq
Rosie> c = {escq / .}*
Warning: reassignment to identifier c
Rosie> .match c "\"\" and more"
{"c": 
   {"text": "\"\" and more", 
    "pos": 1.0}}
Rosie> .match c "\"\" and more marks \"\" in the middle"
{"c": 
   {"text": "\"\" and more marks \"\" in the middle", 
    "pos": 1.0}}
Rosie> 

At the end of the transcript above, we demonstrated that the pattern c will match characters before, after, and including pairs of quote marks, that is, exactly the contents of a quoted field. But this is not exactly what we want.

When does the pattern c stop consuming input? Never! It will greedily consume all characters until the end of the input, including the quote mark that will end the quoted field, and all the fields after it!

-- This naive definition of qf WILL NOT WORK
Rosie> qf = ["] c ["]
Warning: reassignment to identifier qf
Rosie> .match qf "\"foo,bar,   baz\""
Repl: No match  (turn debug on to show the match evaluation trace)
Rosie> 

When we say we want to pattern c to match the contents of the quoted field, we mean that it should consume an escaped quote or anything except a double quote (repeatedly). Then we can define qf, a pattern for the entire quoted field, including the quote marks themselves:

Rosie> c = {escq / [^"]}*
Warning: reassignment to identifier c
Rosie> qf = ["] c ["]
Warning: reassignment to identifier qf
Rosie> .match qf "\"foo,bar,   baz\""
{"qf": 
   {"subs": 
      [{"c": 
         {"text": "foo,bar,   baz", 
          "pos": 2.0}}], 
    "text": "\"foo,bar,   baz\"", 
    "pos": 1.0}}
Rosie> .match qf "\"\"\" and more marks \"\" in the middle\""
{"qf": 
   {"subs": 
      [{"c": 
         {"text": "\"\" and more marks \"\" in the middle", 
          "pos": 2.0}}], 
    "text": "\"\"\" and more marks \"\" in the middle\"", 
    "pos": 1.0}}
Rosie> 

Putting it all together

Assembling the patterns we have written already, we find we need two more definitions: one to match a field (which may be quoted or unquoted), and one to match an entire row. Those are straightforward to write. The result, which can be found in the Rosie repository, looks like this:

uqf = [^,]*
alias escq = "\"\""         -- two double quote marks are one way to escape a double quote character
c = {escq / [^"]}*          -- contents of a quoted field
qf = ["] c ["]              -- the actual quoted field, surrounded by a double quotes

field = qf / uqf
row = field ([,] field)*

The RPL code above is part of the Rosie distribution, and so we can load it all at once and do some interactive tests with the read-eval-print loop:

Rosie> .load $sys/rpl/csv1.rpl
Warning: reassignment to identifier uqf
Warning: reassignment to identifier escq
Warning: reassignment to identifier c
Warning: reassignment to identifier qf
Loaded /Users/jjennings/Dev/public/rosie-pattern-language/rpl/csv1.rpl
Rosie> .match row "Last name, first name, false, 123, \"this, field, has, commas and \"\"more\"\"\""
{"row": 
   {"subs": 
      [{"field": 
         {"subs": 
            [{"uqf": 
               {"pos": 1.0, 
                "text": "Last name"}}], 
          "pos": 1.0, 
          "text": "Last name"}}, 
       {"field": 
         {"subs": 
            [{"uqf": 
               {"pos": 12.0, 
                "text": "first name"}}], 
          "pos": 12.0, 
          "text": "first name"}}, 
       {"field": 
         {"subs": 
            [{"uqf": 
               {"pos": 24.0, 
                "text": "false"}}], 
          "pos": 24.0, 
          "text": "false"}}, 
       {"field": 
         {"subs": 
            [{"uqf": 
               {"pos": 31.0, 
                "text": "123"}}], 
          "pos": 31.0, 
          "text": "123"}}, 
       {"field": 
         {"subs": 
            [{"qf": 
               {"subs": 
                  [{"c": 
                     {"pos": 37.0, 
                      "text": "this, field, has, commas and \"\"more\"\""}}], 
                "pos": 36.0, 
                "text": "\"this, field, has, commas and \"\"more\"\"\""}}], 
          "pos": 36.0, 
          "text": "\"this, field, has, commas and \"\"more\"\"\""}}], 
    "pos": 1.0, 
    "text": "Last name, first name, false, 123, \"this..."}}
Rosie> 

Let’s try the pattern called row against each line of our sample data file:

jennings$ ./rosie -f '$sys/rpl/csv1.rpl' row doc/data/csv/sample_comma.csv 
2011-11-11 X12334 customerdetails 
Bob Dyson true 23 23,456 0 1,543,543.36 ""Interstate 80 Revisted"" 
Jimmy Pace  
Saul McCartney false 67 89,125 4,324,307.5 986,545.12 
Elvis John  
Eric Clayton true 461 33,101 1.11 100.01 \ 
jennings$ 

The last line of the Rosie output (above) is truncated because that line of the data file uses a different way of escaping a double quote: preceding it with a backslash. The exercise of adding this (or other) escape sequences is left to the reader. You can see one way of doing this in the complete csv.rpl file in the Rosie repository.

The file csv.rpl is listed in the Rosie manifest file, and so it is loaded automatically. We can see that the code in this file parses our sample file fully (below). Note the use of the pattern name csv.comma here, and that the rpl file also defines csv.semicolon and csv.pipe to parse files that use those delimiters.

jennings$ ./rosie csv.comma doc/data/csv/sample_comma.csv 
2011-11-11 X12334 customerdetails 
Bob Dyson true 23 23,456 0 1,543,543.36 ""Interstate 80 Revisted"" 
Jimmy Pace  
Saul McCartney false 67 89,125 4,324,307.5 986,545.12 
Elvis John  
Eric Clayton true 461 33,101 1.11 100.01 \"Greatest Hits\" 
jennings$ 

The JSON version of the same output can be found in the Rosie repository, where it has been formatted by a pretty printer. In the JSON output, you can observe that the quoted fields are properly parsed in their entirety, including embedded (escaped) quotation marks.

Conclusion

We have built up a compact set of patterns that will parse CSV data, along the way exploring various features of RPL (the language) and the Rosie Pattern Engine (the implementation, including the read-eval-print loop).

The RPL code we wrote is in the Rosie repository in csv1.rpl, and a generalization of that code is in csv.rpl. That general version is loaded automatically when Rosie starts, and will match CSV data that:

  • is delimited by commas, semi-colons, or pipe characters;
  • has quoted and unquoted fields;
  • uses single or double quotes for quoted fields; and
  • contains any of 3 different ways of escaping the quote character.

A note about a forthcoming RPL feature

A skilled programmer will observe that the final csv.rpl file contains some definitions that follow a pattern, such as the way unquoted field patterns are defined once for each delimiter:

csv.uqf_comma = [^,]*
csv.uqf_semicolon = [^;]*
csv.uqf_pipe = [^|]*

This is a situation that calls for a kind of function definition facility, and such a facility is being explored for RPL. The syntax has not been finalized, but in pseudo-code, one might imagine being able to write:

-- Warning: This capability is not yet implemented, so this won't work!
csv.uqf_comma = read_until(",")

Of course, one must separately define read_until as a function of one argument that returns a pattern which, when used, will read and return all input up to (but not including) the function argument. In the example above, the argument is ",", and the pattern that should be returned by read_until might be [^,]*.

If you are interested in how functions (and possibly macros) can enhance RPL, please leave a comment here or open an issue in the Rosie Pattern Language repository. This is an area of significant evolution for Rosie, and we welcome your comments and suggestions.

What’s next?

In the next blog post, we’ll explore how to automate the pattern-writing process by generating patterns from sample data.

Join The Discussion

Your email address will not be published. Required fields are marked *