Think of your favorite regex tool. How flexible is it when it comes to producing output?

If your reflex is to reach for Perl, Python, or Ruby, you have infinite possibilities at hand. All you need to do is write some code. Some scaffolding to read the input, apply the pattern matching, and generate output in any format you like. It’s a simple matter of programming.

If you reach for grep, you get pretty much just matching lines or matching text output. Although, you can get color-highlighted matches without any coding — but only in one color. Of course, you can always build a little pipeline in which each stage has an expression and a color, like this (with line breaks for clarity):

GREP_COLOR= tail /var/log/system.log | 
   GREP_COLOR='35' egrep --color=always 'apple|$' |
   GREP_COLOR='34' egrep -i --color=always 'syslogd|$' |
   GREP_COLOR='36' egrep --color=always 'launchd|$'

Simple, right? It’s just a matter of consulting the man pages, learning the color codes, splitting up your regex, and doing some extra typing. There is the small matter of making sure that your regex patterns are disjoint. In the example above, the expressions are literals (apple, syslogd, and launchd) that are clearly disjoint.

In other cases, it’s not so obvious. For example, the command below looks like it should print ipv4 addresses in red (31) and domain names in yellow (33):

cat ~/Projects/rosie-pattern-language/test/resolv.conf | 
   GREP_COLOR='31' grep -E --color=always '(([0-9]{1,3})([.][0-9]{1,3}){3})|$' |
   GREP_COLOR='33' grep -E --color=always '(\w+([.]\w+)+)|$' 

But it fails, because the overly-general expression for domain names will also match ipv4 addresses, leading to incorrect output. (Note: I did not have the patience to add a third stage to the pipeline to match and print ipv6 addresses. The example is messy enough as it is, and it’s simpler than many real-world use cases!)

Rosie outputs plain text, highlighted text, and JSON

The screenshot below shows some of the Rosie output formats in action. In the color output, the default colors are: net.fqdn in yellow, net.ipv4 in red, and net.ipv6 in magenta. Note that the pattern we use on the command line is net.any — a pattern defined in terms of the more specific patterns net.fqdn, net.ipv4, and net.ipv6.

The screen capture starts with Rosie producing plain text output, like grep.  Next, with a command line option, each named pattern (like net.ipv4 and net.ipv6) is printed in a different color.  Finally, a different option causes the matches to be printed in JSON structures.


Rosie was created to simplify pattern matching

That includes making it easy to visually confirm that you’re getting the matches you expect. Rosie has a default set of colors assigned to many of the patterns in its standard library, and you can customize those colors as you wish.

Solutions using grep/egrep are clumsy and error-prone. First, writing correct regex on the fly is hard. It’s better to have a library of named patterns for common syntactic entities like network addresses, timestamps, etc. Second, it’s hard to prove at a glance that a set of regular expressions are disjoint, which makes it hard to write a correct pipeline that highlights each pattern with a different color.

Last, I find that color output can be a great help when debugging a larger solution (program or command line) that includes a pattern matching step. That is, I often want to run just the pattern matching step on some sample data and look at the output. With color match highlighting, I can see at a glance whether I’m getting the matches I want. But using color for debugging is pretty hard to do if building the command line pipeline itself is error-prone.

How Rosie produces color output

Rosie comes with a library of pre-defined patterns, and you can add to the library. For the current release, Rosie v0.99, see the rpl directory in the master branch. We are building up an equivalent library, typically using shorter pattern names, in the work leading up to Rosie v1.0. The first of the Rosie v1.0 patterns can be seen in the rpl directory of the v1-tranche-3 branch.

With a pattern library, patterns have names. So, Rosie can map names to colors. The -o color command line option causes Rosie to look up the name of each match (e.g. net.ipv4, net.fqdn) to see if a color is defined for it. There are default colors for many of the patterns in the standard library, and it is easy to modify them.

It is also easy to assign colors to your own custom patterns, or to patterns you load from a third-party library. But that’s a topic for a future post.

Beyond color highlighting

Color highlighting is for humans. When we write scripts, and when we are data mining, the output of pattern matching is the input to another program. Rosie can generate plain text output, like grep, which makes it easy to integrate Rosie into existing scripts.

But Rosie can also produce JSON. The structure of the JSON reflects the structure of the match. If a pattern has sub-patterns, a match for that pattern will have sub-matches. This structure is actually a parse tree, but that’s another topic for another future post.


Output formats like color highlighting and JSON are how #modernpatternmatching is done.


Follow us on Twitter for announcements. We expect v1.0.0 to be released later this summer.