It’s easy to make a mistake when entering a regular expression on the command line. And, sometimes, we make a hard-to-spot error in a regular expression that is part of a program. Usually, those errors are not caught at compile time — but of course we want to catch as many errors as we can at compile time.

When a regex appears as a string in a Java program, as it might in other languages, the Java compiler doesn’t understand that the string is a regex, and can’t check for mistakes that will throw run-time exceptions later.

One analysis found 56 regex syntax errors in an examination of five open source Java projects.[1] These projects would compile and run, but eventually, when the offending regex was used, it would throw a PatternSyntaxException or an IndexOutOfBoundsException.

The researchers that did the analysis proposed a solution for Java using annotations, but generalizing this approach is fraught. First, it’s easy to find and analyze string constants, but much harder to analyze strings that have been constructed via string operations, like concatenation and substitution. Second, an approach that works for Java may not port easily to other languages, especially the popular dynamic languages.

Rosie Pattern Language (RPL)[2] provides a way to avoid run-time errors by pre-compiling and automatically testing patterns, thus finding more errors at compile time. Here’s an example of the rosie test command:

The screen capture shows the command 'rosie test rpl/*.rpl'.  The output of that command is several lines, each of which is the file name, such as 'date.rpl', followed by the number of pattern tests that passed or failed.  One file failed to compile, and the error message about an unbound variable is shown.

Pre-compiling RPL patterns

As shown in the image above, RPL definitions can be (optionally) pre-compiled to find errors before deployment. Note that mistakes related to numbered captures are not possible in RPL, which names all captures. Syntax and other errors, such as missing dependencies are caught by the RPL compiler. (A linter is on the drawing board to detect subtle errors, and possibly also to evaluate tests for “pattern coverage”.)

The image above shows the rosie test command in action, compiling a directory of RPL files (and running embedded tests — more on that in a minute).

You can, of course, write RPL expressions as string literals directly into large programs or small scripts. But for production use, you can move those expressions into their own file. Rosie can then pre-compile your patterns as part of your build, when other code is compiled, catching errors at compile time, not later at run time.

In the image above, an error was deliberately injected into the file ts.rpl for illustration. The error is in the statement foo = bar on line 64, which fails to compile because no pattern named bar is defined.

Catching errors early embodies the “shift left” concept, as popularized by the recent DevOps trend. And, Rosie goes beyond pre-compilation to include a lightweight test capability as well.

Rosie’s built-in lightweight test facility

RPL files can contain comments (which start with a double dash --), and comments are just as valuable for patterns as they are for code. Comments can explain the intention behind the pattern, its intended use. Specially formatted comments in RPL are read by the rosie test command and executed as pattern tests.

Such comments begin with the word test, followed by the name of the pattern. The next word indicates the type of test. There are currently three types: accepts, rejects and contains. The first two should be obvious, and they are applied in turn to each quoted input string that follows. (The contains test requires an additional argument, which is the name of a sub-pattern, and is used to ensure that the expected sub-pattern appears in a match.)

In the package date (in the file date.rpl), the pattern called us matches the way that dates are commonly written in the United States. The comment below triggers three tests when rosie test is run:

-- test us accepts "April 1, 1900", "Jan 23 2017", "Apr     8"

Tests like this serve two purposes. They contribute to the documentation of the pattern by giving explicit examples of input that matches (is accepted) and input that should not match (is rejected). And because the tests are executable, they function as regression tests, ensuring that the patterns continue to work as intended despite changes that might be made to the RPL code in that file, or to any dependency.

The figure below shows the part of the file date.rpl where date.eur and are defined. The comments following each pattern are executed by the rosie test command.

A portion of a file of RPL code is shown, with pattern definitions for two date formats, US and European.  Near each definition is a set of tests, written as comments, which include example input text.  There are two kinds of tests: Pattern X accepts input Y, and Pattern X rejects input Y.

Rosie was created for scalable pattern matching

Scalability goals for Rosie include big data, large (complex) patterns, and many developers. Pre-compilation and built-in executable tests are steps towards those goals. They are steps towards bringing the tools and techniques we use in general programming to the patterns we write to process raw data.

Pre-compilation and built-in testing are key parts of #modernpatternmatching.

[1] Spishak, Dietl, and Ernst, “A type system for regular expressions”. In Proceedings of the 14th Workshop on Formal Techniques for Java-like Programs, ECOOP 2012

[2] The rosie test command is implemented in a working prototype in branch tranche-3 of the Rosie Pattern Language repository on github. This prototype is evolving by steps to become release v1.0.0.

Follow us on Twitter for announcements. We expect v1.0.0 to be released late this summer.