Posted on Aug 30, 2017
RPL has many concepts in common with regex, and the syntax of RPL reflects this. So if you know regex, you know a lot of RPL already!
Rosie Pattern Language (RPL) is more powerful than regular expressions (regex). RPL can express every pattern that regex can, as well as recursive patterns (like html, xml, json, and s-expressions) that regex cannot. But RPL has so much in common with regex that it’s easy to get started. Let’s look at the concepts that are the same, and the ones with small (but important) differences.
Familiar concepts
The following RPL concepts are nearly identical, and have nearly identical syntax, to their regex counterparts.
Repetition
RPL expression | Matches |
---|---|
pat* |
Zero or more copies of pat |
pat+ |
One or more copies of pat |
pat? |
Zero or one copies of pat |
pat{n} |
Exactly n copies of pat |
pat{n,m} |
At least n and at most m copies of pat (n defaults to 0 and m to unlimited) |
Repetition in RPL is greedy and possessive, which are not the defaults for regex. In other words repeated patterns in RPL will always match as many copies as possible (greedy), and will not backtrack to try fewer copies (possessive).
In practice, this speeds up the matching (less backtracking), but more
importantly, it simplifies the mental model of how matching works. Keeping in
mind all the ways that regex can backtrack requires much practice, and is a frequent
source of errors when writing regex patterns. RPL backtracks only where there
are explicit choices, so understanding an RPL expression is more like
understanding if-then-else
logic. (See the section on choices, below.)
Character sets
RPL expression | Meaning |
---|---|
[:name:] |
Named character set (see note [a]) |
[:^name:] |
Complement of a named character set |
[x-y] |
Range of characters, from x to y (see note [b]) |
[^x-y] |
Complement of a character range |
[...] |
List of characters (in place of ... ) |
[^...] |
Complement of the character list ... |
[cs1 cs2 ...] |
Union of character sets cs1 , cs2 , etc. (E.g. [[a-f][0-9]] ) |
[^ cs1 cs2 ...] |
Complement of a union of character sets |
Note [a]. In Rosie v0.99k, the named character classes are the familar Posix ones:
alpha
, digit
, space
, etc. Coming, in Rosie v1, will be character
classes built from Unicode properties including category, script, and block.
Note [b]. In Rosie v1, the meaning of [x-y]
will be “all the characters whose
code points (in Unicode) are between the code points of x
and y
, inclusive.
In addition to the ability to specify character sets based on Unicode properties, additional character set operations are forthcoming. Notably, the ability to specify the difference and the intersection of arbitrary character sets will be supported.
Look around
RPL expression | Meaning |
---|---|
> pat |
Look ahead at pat (predicate: consumes no input) |
< pat |
Look behind at pat (predicate: consumes no input) |
!pat |
Not pat , i.e. not looking at pat . Same as !>pat . |
Here, we have greatly simplified the regex syntax for the “look around”
expressions lookahead and lookbehind. The parentheses and the leading ?
that are needed in regex are not needed here. And, we have made the symbols for
lookahead and lookbehind symmetric, using left-pointing and right-pointing angle
characters.
There is no special expression for negative lookbehind, because RPL expressions
compose. We can simply write !<pat
, which means “not looking behind at
pat
”. For what it’s worth, <!pat
is equivalent, though technically it means
“looking behind at something which is not pat
”.
Sequences
RPL expression | Meaning |
---|---|
p q |
Sequence: match p and next match q |
In RPL, whitespace is ignored, so we can literally write p q
(with a space
in between).
Variations on regex concepts
Literals
RPL expression | Meaning |
---|---|
"abcdef" |
(String literal) Matches the string abcdef . |
RPL uses identifiers to denote patterns. You can define a pattern named p
and
one named q
and then write expressions like the sequence p q
or the choice
p/q
. To match a string literally in RPL, you quote it to distinguish it from
identifiers.
The pattern "Hello, world"
will match only the input Hello, world
, with
exactly one space after the comma.
Literals can contain familar escape characters like \n
(newline) and also
Unicode codepoint syntax like \u263b
(a smiling face: ☻) or \U0001F63C
(cat face with
wry smile: 😼).
Choices (alternatives)
RPL expression | Meaning |
---|---|
p / q |
Ordered choice: match p , and if p fails, match q |
RPL supports ordered choice, not the “anything goes” alternation of regex. To
emphasize the difference, we denote choice in RPL with a slash (/
) instead of
the regex pipe (|
).
As the name suggests, ordered choices are attempted in the order written. And, choices in RPL are possessive, meaning that once an alternate succeeds, none of the remaining alternates will be attempted.
The notion of ordered choice, as well as the slash syntax, comes from Parsing Expression Grammars (PEGs), on which Rosie is based.
The way to think about ordered choice is that backtracking occurs only within the choice expression. For example, consider the expression:
(p/q/r) s
It is a sequence of two other RPL expressions, the first of which is
the choice (p/q/r)
(parenthesized for grouping, like we would write the
arithmetic expression (k+1)*j
in a program).
The RPL matching engine will first try to match the pattern p
. If that fails,
it will reset (backtrack) to the position it was in when it started to match p
, and it will
try q
. Failing that, it resets again and tries r
. If all of the alternatives fail, the
pattern fails. If one of them succeeds, the engine continues by trying to match
s
. Here’s the key: If s
fails, the entire expression fails. RPL does not
backtrack to the choice and try a different alternative. Once a choice
succeeds, the choice is locked in.
Of course, we can obtain the behavior of the regex alternation if we need it. We just have to
write explicitly what we want to happen. The RPL equivalent to the regex (?:p|q|r)s
would be:
{p s} / {q s} / {r s}
Two observations:
-
The regex needs
?:
in order to use parentheses for grouping. Otherwise, they both group and define a capture. In RPL, captures are automatic, and not conflated with grouping. More on this in future posts. -
Both parentheses
()
and curly braces{}
are used for grouping in RPL. They differ when surrounding a sequence, in which case()
effectively tokenizes the input. The pattern("nameserver" net.any)
will match the wordnameserver
, followed by a token boundary (such as whitespace), followed by a network address. I.e. it will matchnameserver 1.2.3.4
.
Enhancements over regex
RPL has many enhancements over regex, and we will post about those very soon. They include:
- Captures are automatically named by their pattern identifier (not numbered!)
- Rosie can fully parse its input, giving a parse tree as output
- RPL definitions can be put into a package, making it easy to share
- Rosie is built as a portable C library, with interfaces defined in several languages (including Python, Go, Ruby, and Node.js)
- And a secret new feature: An extensible macro facility will make it possible to write complicated patterns in a concise way, saving typing but also making patterns easier to read. Anonymous sources high up in the Rosie project are unable to provide more information at this time. 🤐 Watch this site for more in the coming weeks!
Move from regex to Rosie for scalable pattern matching
Building on so many familiar regex concepts, the learning curve is not too steep. And Rosie is a technology designed to save development time:
- Avoid writing patterns at all, by reusing existing ones from available packages
- No “driver program” needed (unless you want one) because the CLI can output JSON, which can be piped into another application or saved
- Trying out ideas (and debugging in general) is easy with the built-in REPL and trace capability
This is how #modernpatternmatching is done.
Follow us on Twitter for announcements. We expect v1.0.0 to be released around the celestial end of summer, i.e. the autumnal equinox.