Posted on Aug 15, 2017
Regex are hard to debug when they fail to match what you think they should match (and vice versa). That’s why there are so many websites offering regex debugging tools. Rosie expressions can likewise be hard to debug at times, and I think for the same reason: Pattern matchers (parsers, generally) are algorithms with a very large number of states, essentially all of which influence the next step to be taken. There are many ways that a human being’s mental model of the algorithm’s state can be wrong.
Plus, debugging any program (not just pattern matchers) can be hard!
Rosie provides a number of features that aid debugging, including:
- Symbols: Patterns have names, and names enable symbolic debugging
- Compositon: Patterns are composed from other patterns, enabling divide and conquer, where you separately debug the pieces that make up your pattern
- Built-in tests: You can optionally declare tests in your comments, and Rosie will execute them, helping to find errors early (and to understand the author’s intention of what was supposed to match in the first place)
- Read-eval-print loop: Rosie’s repl let’s you build up patterns interactively, and test against real data
- Trace command: It’s easy to peek inside the Rosie pattern engine to see exactly how your pattern is being applied to your input
The read-eval-print loop of Rosie version 1.0 will be the subject of a future blog post. (You can read about the existing repl in, e.g. this blog post).
In this post, we will take a very quick look at the trace
command, which works
just like Rosie’s match
command, except it outputs a trace of the matching
process in the form of a tree (reflecting how patterns are composed from other
patterns).
Here is an example in which Rosie’s date.any
pattern fails to match “12
Agosto 2017”, and how we might discover that “Agosto” is not one of the month
names defined in the date
package. (The rest of this post will explore this
transcript in detail.)
Using the match
command
The Rosie CLI can be used the way grep is used, to quickly find information in files. And like most unix tools, Rosie can read from standard input (which requires a single dash in place of the filename argument).
So we can pipe the output of the unix date
command into Rosie as shown below.
On OS X and Linux, date -R
will print the date in the RFC 5322 format, long
known as the RFC 2822 Internet Messaging format.
Naturally, we can also echo
sample input and pipe that into Rosie:
Note: We are only parsing the date in these examples, not the entire timestamp, in order to keep the examples short.
The first 3 commands above succeeded, which we can see because the date was
printed by Rosie (and printed in blue, the default color for dates and times in
Rosie). If we had added -o json
to the commands, we would have seen that “Sat
Aug 12” matched date.us
; “Sat, 12 Aug 2017” matched date.rfc2822
, and “12
August 2017” also matched date.rfc2822
.
The last command in the transcript above failed (there was no output). The
input was “12 Agosto 2017”. Let’s see how exactly it failed by using the
trace
command.
Using the trace
command
Simply replace match
with trace
in the rosie invocation. The output, given
our sample input, will look like the screen capture below. Here, I’ve cut out
the nested part (replaced with “…”) so that we can see the top level of the
tree.
The root of the tree, at the top in the thin green box, is the expression that
is the definition of date.any
. You can see that date.any
is defined as an
ordered choice between 6 different date formats.
The first level of nodes under the root show the evaluation of each date format
in turn, starting with date.us
and ending, at the bottom, with
date.rfc3339
. Each alternative concludes with “No match”, which we expect,
because we know the pattern date.any
failed to match this input.
One alternative, date.rfc2822
, has a subtree shown. The root of that subtree
is the definition of date.rfc2822
, which begins with the expression
{day_name ~ ","}?
. The question mark at the end indicates, as in regex, that
the expression must match 0 or 1 time. And the expression is a sequence of 3
other expressions: day_name
, ~
(the Rosie “word boundary” expression), and
","
(a literal comma).
We will explain Rosie’s ability to automatically insert the boundary pattern in a future blog post. (If you’re curious, see this documentation for Rosie v0.99k.)
Back to our trace output… Let’s look at the details of how the pattern
date.rfc2822
was processed against the inpout “12 Agosto 2017”. Here is the
full transcript:
The first sub-expression matches 0 characters, because although there is no
day_name
in the input, the first sub-expression is optional. The next
sub-expression is the word boundary, ~
, which in this case matches 0
characters because we are at the start of the input (character position 1). The
next sub-expression, day
, matches 2 characters (“12”) and the next one, a
boundary, matches the space after “12”.
Next is the sub-expression month_name
, and the engine is now at position 4 of
the input, looking at “Agosto 2017”. As we see, there is “No match” (next to
the red arrow in the screen capture).
The remaining two parts of the sequence, ~
and year
, are not attempted,
because the sequence has already failed. Popping back to the first level of
child nodes, Rosie goes on to try to match date.rfc3339
, which also fails, and
so date.any
(the root expression) fails.
Even more detail is available
Rosie’s trace
command is smart enough to choose how much detail to show.[1] The
output contains the most relevant matching steps taken along the path in the
tree that was the most productive; that is, the path that consumed the most
input. This is a good heuristic (though not perfect) when a parser has to guess
which of the various alternatives was the one that user wanted to succeed.
If you want to see a complete trace, in which every possible branch is explained
in detail, try the same command with the --verbose
flag added. The tree
output will be longer, but hopefully still readable once you get a feel for what
you are looking at.
Rosie was created for scalable pattern matching
Scalability goals for Rosie include big data, large (complex) patterns, and many developers. The ability to trace a match, at varying levels of detail, is a key debugging feature. Much like reading a stack trace, you find a lot of information. But again, like reading a stack trace, you quickly get used to understanding them.
Tracing, without having to paste your patterns and data into some random website, is how #modernpatternmatching is done.
[1] The output shown in this post is from a working version of Rosie in branch
tranche-3
of
the Rosie Pattern Language repository on gitlab. This
prototype is evolving by steps to become release v1.0.0.
Follow us on Twitter for announcements. We expect v1.0.0 to be released late this summer.