Rosie's purpose (and Rosie 1.3)
With the recent release of version 1.2.2, it seems like a good moment to talk about the future. We are making plans for a version 1.3 and, eventually a “Rosie 2.0”. In this post, we’ll first look at the use cases for Rosie/RPL and then what’s coming in Rosie v1.3. We’ll save the topic of “Rosie 2.0” for a future post.
Use cases for Rosie/RPL
Background: IBM, circa 2015
The Rosie Project, and the RPL language, were initially developed at IBM. After roles in product architecture, technical strategy, and technical standards, I found myself on what we called an advanced technology team. That was a fancy name for a fun job: looking at tech trends and building prototypes of whatever we thought might be important to IBM’s software business over a 1-3 year time frame.
Our research division often focused on longer time horizons, and of course product development teams had very short term goals. An advanced tech team was a way to bridge the gap. For instance, when Docker was new, I figured out how to package some of our WebSphere products in Docker containers. By doing this well before any customer asked for it, we could consult with other IBM business units about how their products would or wouldn’t work in a container, as well as advise Tivoli (IBM’s system management brand) on how to manage container deployments.
Back to Rosie! Around 2015 or so, my advanced technology team was building a prototype product for IBM cloud, which would (ideally) give actionable feedback to development teams about their process. Our aim was to use machine learning to generate suggestions about what code should be reviewed, where team agility or communication was suffering, and even where future bugs were likely to arise.
To generate those suggestions and predictions, our prototype hoovered up data from source code commits, reported issues, test results, and log files of running systems – in other words, data from almost all phases of development.
Our prototype had Scala, Python, and Lua code in it, with an occasional shell or Perl script, for good measure. We were taking the fastest path to ingesting all the needed data, so we could move on to the ML work. Every piece of data ingestion code used regex, and a significant number of regex were duplicated across components in various languages (with the small differences needed to accommodate the variations in dialect). We may have had 40 or 50 regex, and we were just getting started!
I looked at this situation in dismay. It was defensible as a prototype, but if we turned this into a product later, how would we manage a large regex collection? I wondered how anyone did this at all.
If we turned this into a product, we would settle on one regex dialect. That would help a lot. But how do we properly test an embedded DSL like regex? And how high would be the effort to maintain a large regex collection over the years and years of a product lifetime? More fundamentally, why did we have to write regex for recognizing such mundane textual patterns like ip addresses, dates, times, and the like?
Shouldn’t these regex already exist, in a curated library with unit tests and documentation? A library that, like any code library, was actively maintained?
By chance, I had earlier worked with another group at IBM who themselves had a bespoke collection of maybe 200 or so regex patterns. For a couple of decades, they stored this collection in a spreadsheet. Let that sink in – a spreadsheet was a key part of their development and deployment process!
The spreadsheet had columns for the regex pattern, a pattern name, examples of positive and negative matches, and a comment field. They could export the spreadsheet to a CSV file, which could then be read by code (Java, if I recall) that needed the regex. The spreadsheet also stored a history of updates to the regex, in a manual form of revision tracking that preserved key information about why and when regex patterns were modified.
The spreadsheet persisted for years (decades, actually), through many large changes to this system, because it served a purpose that could not easily be achieved with the software tools available over that time period.
Rosie Pattern Language
That spreadsheet of regex, which I had seen years before, was on my mind as we worked on the assortment of data ingestion components of our prototype ML system.
I had written some Lua code and liked the lpeg
library, which implemented a
backtracking PEG parser that was generally both space and time efficient. The
elegance of lpeg
had me considering Parsing Expression Grammars as a singular
alternative to the myriad regex dialects with their seemingly infinite capacity
to imbue a project with subtle bugs.
The lpeg
library use backtracking, needing less space than “packrat”
implementations, at the cost of occasional super-linear run times (depending,
critically, on the pattern being matched). While exponential backtracking is
possible, it takes visible effort to make it happen, requiring a long and
tortured PEG pattern. (This is a very different situation than for regex, where
a short innocuous pattern can cause an exponential time match attempt.)
Importantly, in our particular use case, the pattern expressions were written by developers, not the public. Our threat model did not include anonymous injection of patterns designed to trigger super-linear matching behavior.
With Lua’s small size and the capabilities of lpeg
, I envisioned a small
(PEG-based) library that could replace a regex library for pattern matching in
text. The idea of a language-independent DSL with a small and efficient library
to implement it began to take shape.
Designed for modern software development
The era of the spreadsheet as a regex repository must end. The use of sites like Stack Overflow as a source of (unmaintained) regex of questionable quality and provenance must end. So, too, the cryptic “write-only” regex syntax that is by far the most common. I saw a chance to do things differently, and sketched what became Rosie Pattern Language (RPL).
RPL is fairly readable, with the ability to give names to patterns and reuse them to build larger patterns (which is hard to do effectively with regex). Comments in RPL files let the developer cite their sources and describe their rationale. Automated, integrated unit tests provide ready-made regression tests, as well as a form of documentation showing what is intended to be matched and not matched.
The overall approach is meant to work well with modern software development
processes and tools, which did not exist when regex were first used by
programmers in the early 1970’s. RPL files look like code and they diff
like
code. Rosie compiles and tests patterns outside of their use in programs (where
they can be used by name). Problems that would be run-time errors for regex,
such as compilation issues or regressions, are caught by Rosie at compile time.
Rosie also supports several techniques for debugging patterns, including a
trace
capability that produces output similar to a backtrace. Modern software
development should not rely on the availability and accuracy of some random
“regex help” website for debugging!
Transition
We (IBM) released Rosie as open source under the MIT license, and we were starting to build an external community of users when I left IBM to return to academia in 2018. I joined the Computer Science faculty at NC State University that year, beginning my second stint as a professor after a 19 year industry sabbatical. 😎
Tools like Rosie/RPL benefit from a network effect in which a growing community of users makes the project more attractive for new users. The technology itself is designed to facilitate sharing of patterns. But evangelism for the project slowed while I focused on my new duties as a professor.
With a growing, if small, user base, it’s now time to renew our focus on adoption of RPL and use of the Rosie implementation. I believe that we have most of what is needed to replace many regex, particularly for production code but also for one-off scripts.
We can do some things to increase usability and adoption of Rosie/RPL, and to make the learning curve for regex users more smooth.
And that’s why I want to share some thoughts for Rosie version 1.3 (to finish this post) and beyond (in a future post on ideas for a “Rosie 2.0”).
Version 1.3
There are some things we really like about writing RPL instead of regex, and using Rosie to do the matching. That list includes:
- Module system (name spaces for patterns)
- Integrated executable unit tests for patterns
- Compiler/runtime project structure, with a tiny byte-code vm for matching
- Runtime (vm) knows only bytes, nothing about string encodings
- Similar syntax to regex for similar concepts
Syntax choices are hard to make. The requirement to surround a pattern with
curly braces, like {"a" [0-9]}
when you want to match strings like “a5”, trips
up every RPL user from time to time. Without curly braces, you get the default
interpretation of a pattern, which is that each part is its own token. In other
words, "a" [0-9]
matches “a 3” but not “a3”.
Likewise, "search" net.any
matches the word “search” followed by whitespace (or a
word boundary) and then a network pattern, i.e. it will match “search
example.com”.
We have stuck with these syntax choices, the good ones and the bad ones. We won’t introduce syntax changes any time soon. (And if we do, someday, we plan to provide an automated upgrader for old patterns.)
So you can see that we have a list of decisions that we might reconsider someday. All projects do. The need for curly braces is on that list. But that’s not for the short term. In the short term, we have a plan for version 1.3 that has two parts: better language bindings, and enhanced unicode case support.
Language bindings
Some of the repositories in the
Rosie Community site contain language
bindings, also known as “clients” for the Rosie library, librosie
. (If you
have rosie
installed, you have librosie
as well.) The Python client has
users, probably because the other clients are less fully developed. Also, you
can pip install rosie
to get the Python interface to librosie
. (Be sure
to install rosie
as well!)
A goal for 2021 is to improve these clients, to make it easier for people to use RPL within their programs, just as it’s easy to use a regex library today.
Unicode case sensitivity
Today, Rosie supports case insensitive matching only for the ASCII letters. We have implemented support for full Unicode case insensitive matching, in all Unicode scripts, but it is not yet released.
A goal for 2021 is to integrate these Unicode enhancements into the main branch, and release them as part of Rosie version 1.3.
Conclusion
We are hard at work on Rosie version 1.3, and in parallel we are working on the foundation of “Rosie 2.0”, which will be the subject of my next post.
A number of talented NCSU students have been contributing to the project in a variety of areas, from RPL compiler optimizations to Unicode support.
We also have some projects that use Rosie/RPL as a key component. One such project aims at cleansing CSV data files so that they can be safely loaded into spreadsheets like Excel and Google Sheets. Did you know that spreadsheets regularly corrupt scientific data, such as by misinterpreting gene notation and silently casting it to a date?
In what must be the most bizarre workaround in history, scientists are renaming human genes to accommodate Excel.
Here’s the kicker: It turns out there are a variety of automatic silent “type casts” corrupting countless data sets – all because spreadsheets are a convenient tool for scientists, economists, and others to browse data. Having one tool to fix them all would be great, but it would have to be easily extensible to new use cases. We can use RPL patterns to detect the data that Excel will corrupt, so we can apply appropriate mitigations. And users will be able to add their own patterns.
But that’s a story for another day. Thanks for reading!
We welcome feedback and contributions. Please open issues (or merge requests) on GitLab, or get in touch by email.
Edit August 15, 2023: You can find my contact information, including Mastodon and LinkedIn coordinates, on my personal blog. The mailing list https://groups.io/g/rosiepattern has fallen out of use since we mostly use Slack, but perhaps it will be revived.