RPL has many concepts in common with regex, and the syntax of RPL reflects this. So if you know regex, you know a lot of RPL already!

Rosie Pattern Language (RPL) is more powerful than regular expressions (regex). RPL can express every pattern that regex can, as well as recursive patterns (like html, xml, json, and s-expressions) that regex cannot. But RPL has so much in common with regex that it’s easy to get started. Let’s look at the concepts that are the same, and the ones with small (but important) differences.

Familiar concepts

The following RPL concepts are nearly identical, and have nearly identical syntax, to their regex counterparts.

Repetition

RPL expression Matches
pat* Zero or more copies of pat
pat+ One or more copies of pat
pat? Zero or one copies of pat
pat{n} Exactly n copies of pat
pat{n,m} At least n and at most m copies of pat (n defaults to 0 and m to unlimited)

Repetition in RPL is greedy and possessive, which are not the defaults for regex. In other words repeated patterns in RPL will always match as many copies as possible (greedy), and will not backtrack to try fewer copies (possessive).

In practice, this speeds up the matching (less backtracking), but more importantly, it simplifies the mental model of how matching works. Keeping in mind all the ways that regex can backtrack requires much practice, and is a frequent source of errors when writing regex patterns. RPL backtracks only where there are explicit choices, so understanding an RPL expression is more like understanding if-then-else logic. (See the section on choices, below.)

Character sets

RPL expression Meaning
[:name:] Named character set (see note [a])
[:^name:] Complement of a named character set
[x-y] Range of characters, from x to y (see note [b])
[^x-y] Complement of a character range
[...] List of characters (in place of ...)
[^...] Complement of the character list ...
[cs1 cs2 ...] Union of character sets cs1, cs2, etc. (E.g. [[a-f][0-9]])
[^ cs1 cs2 ...] Complement of a union of character sets

Note [a]. In Rosie v0.99k, the named character classes are the familar Posix ones: alpha, digit, space, etc. Coming, in Rosie v1, will be character classes built from Unicode properties including category, script, and block.

Note [b]. In Rosie v1, the meaning of [x-y] will be “all the characters whose code points (in Unicode) are between the code points of x and y, inclusive.

In addition to the ability to specify character sets based on Unicode properties, additional character set operations are forthcoming. Notably, the ability to specify the difference and the intersection of arbitrary character sets will be supported.

Look around

RPL expression Meaning
> pat Look ahead at pat (predicate: consumes no input)
< pat Look behind at pat (predicate: consumes no input)
!pat Not pat, i.e. not looking at pat. Same as !>pat.

Here, we have greatly simplified the regex syntax for the “look around” expressions lookahead and lookbehind. The parentheses and the leading ? that are needed in regex are not needed here. And, we have made the symbols for lookahead and lookbehind symmetric, using left-pointing and right-pointing angle characters.

There is no special expression for negative lookbehind, because RPL expressions compose. We can simply write !<pat, which means “not looking behind at pat”. For what it’s worth, <!pat is equivalent, though technically it means “looking behind at something which is not pat”.

Sequences

RPL expression Meaning
p q Sequence: match p and next match q

In RPL, whitespace is ignored, so we can literally write p q (with a space in between).

Variations on regex concepts

Literals

RPL expression Meaning
"abcdef" (String literal) Matches the string abcdef.

RPL uses identifiers to denote patterns. You can define a pattern named p and one named q and then write expressions like the sequence p q or the choice p/q. To match a string literally in RPL, you quote it to distinguish it from identifiers.

The pattern "Hello, world" will match only the input Hello, world, with exactly one space after the comma.

Literals can contain familar escape characters like \n (newline) and also Unicode codepoint syntax like \u263b (a smiling face: ☻) or \U0001F63C (cat face with wry smile: 😼).

Choices (alternatives)

RPL expression Meaning
p / q Ordered choice: match p, and if p fails, match q

RPL supports ordered choice, not the “anything goes” alternation of regex. To emphasize the difference, we denote choice in RPL with a slash (/) instead of the regex pipe (|).

As the name suggests, ordered choices are attempted in the order written. And, choices in RPL are possessive, meaning that once an alternate succeeds, none of the remaining alternates will be attempted.

The notion of ordered choice, as well as the slash syntax, comes from Parsing Expression Grammars (PEGs), on which Rosie is based.

The way to think about ordered choice is that backtracking occurs only within the choice expression. For example, consider the expression:

(p/q/r) s

It is a sequence of two other RPL expressions, the first of which is the choice (p/q/r) (parenthesized for grouping, like we would write the arithmetic expression (k+1)*j in a program).

The RPL matching engine will first try to match the pattern p. If that fails, it will reset (backtrack) to the position it was in when it started to match p, and it will try q. Failing that, it resets again and tries r. If all of the alternatives fail, the pattern fails. If one of them succeeds, the engine continues by trying to match s. Here’s the key: If s fails, the entire expression fails. RPL does not backtrack to the choice and try a different alternative. Once a choice succeeds, the choice is locked in.

Of course, we can obtain the behavior of the regex alternation if we need it. We just have to write explicitly what we want to happen. The RPL equivalent to the regex (?:p|q|r)s would be:

{p s} / {q s} / {r s}

Two observations:

  • The regex needs ?: in order to use parentheses for grouping. Otherwise, they both group and define a capture. In RPL, captures are automatic, and not conflated with grouping. More on this in future posts.

  • Both parentheses () and curly braces {} are used for grouping in RPL. They differ when surrounding a sequence, in which case () effectively tokenizes the input. The pattern ("nameserver" net.any) will match the word nameserver, followed by a token boundary (such as whitespace), followed by a network address. I.e. it will match nameserver 1.2.3.4.

Enhancements over regex

RPL has many enhancements over regex, and we will post about those very soon. They include:

  • Captures are automatically named by their pattern identifier (not numbered!)
  • Rosie can fully parse its input, giving a parse tree as output
  • RPL definitions can be put into a package, making it easy to share
  • Rosie is built as a portable C library, with interfaces defined in several languages (including Python, Go, Ruby, and Node.js)
  • And a secret new feature: An extensible macro facility will make it possible to write complicated patterns in a concise way, saving typing but also making patterns easier to read. Anonymous sources high up in the Rosie project are unable to provide more information at this time. 🤐 Watch this site for more in the coming weeks!

Move from regex to Rosie for scalable pattern matching

Building on so many familiar regex concepts, the learning curve is not too steep. And Rosie is a technology designed to save development time:

  1. Avoid writing patterns at all, by reusing existing ones from available packages
  2. No “driver program” needed (unless you want one) because the CLI can output JSON, which can be piped into another application or saved
  3. Trying out ideas (and debugging in general) is easy with the built-in REPL and trace capability

This is how #modernpatternmatching is done.


Follow us on Twitter for announcements. We expect v1.0.0 to be released around the celestial end of summer, i.e. the autumnal equinox.