Rosie the Riveter icon

Rosie Pattern Language

Modern text pattern matching to replace regex

Destructuring data

Last updated on 30 Jul 2018

Is your data structured for humans, not for easy processing? Do you have data elements like CSC316 from which you want to extract the department (CSC) and the course number (316)? But you have other data in geo-coordinates like (35.7692755,-78.6786137). And then there are also lists of items usually separated by commas, but sometimes by semi-colons. A single Rosie pattern can destructure all of these and more.

The destructure package

In the Rosie community group on GitLab, we have started a repository for working with “raw data”. The first contribution is the destructure package, which originated in the Pixiedust Rosie project. One parent of that project, Pixiedust, is a very cool productivity tool for notebooks. Pixiedust makes it very easy to explore, visualize, and manipulate data.

The addition of Rosie Pattern Language gives Pixiedust more capabilities, such as automatically destructuring data – that is, recognizing when a column contains entries like (35.7692755,-78.6786137) and offering to break up such a column into two new “synthetic” columns, one with the first coordinate and one with the second. Similarly, when a column contains alphanumeric codes like MAE214, Pixiedust+Rosie will offer to split those codes into its alpha and numeric parts, each in their own column.

The Rosie package destructure.rpl has patterns for recognizing a variety of structured data. And the pattern destructure.tryall does what it says: it tries all the various destructuring patterns.

Here’s an example of destructure.tryall at work:

A single pattern named destructure.tryall is used with the rosie grep
command to match lines in a sample file.  The output is set to color, which is
used to indicate the following: (1) lists of items separated by commas or
semi-colons are parsed into their constituent pieces; (2) alphanumeric codes
like CSC316 are recognized as such, and the alpha part is shown in one color
while the numeric part is shown in another color to demonstrate that Rosie
parsed it correctly; (3) some of the sample input is enclosed in braces or
parentheses.

You can see by the color output that Rosie recognized all of the structured patterns in the input: the lists that use semi-colons, commas, and dashes between items; the lists in parentheses or braces; and the items that have alphanumeric structure. The latter are displayed with the alpha part in blue and the numeric part in cyan.

The color and libpath settings in my ~/.rosierc file tell Rosie where to find the destructure library, and what colors to use. From my libpath settings, you can see that I have cloned two community repositories, lang and rawdata into my community directory.

-- ~/.rosierc

libpath = "/usr/local/lib/rosie/rpl"
libpath = "/Users/jennings/Projects/community/lang"
libpath = "/Users/jennings/Projects/community/rawdata"

colors="destructure.find.<search>=red:destructure.alpha=blue:destructure.num=cyan"

Contribute patterns, code, or questions

The Rosie Community group on gitlab.com was created for contributions of patterns and tools. There are just a few repositories there now, but we expect this group to grow.

Feedback

Edit to contact information, August 15, 2023.

We welcome feedback and contributions. Please open issues (or merge requests) on GitLab, or get in touch by email.

You can find my contact information, including Mastodon and LinkedIn coordinates, on my personal blog. The mailing list https://groups.io/g/rosiepattern has fallen out of use since we mostly use Slack, but perhaps it will be revived.