Destructuring data
Is your data structured for humans, not for easy processing? Do you have data
elements like CSC316
from which you want to extract the department (CSC) and
the course number (316)? But you have other data in geo-coordinates like
(35.7692755,-78.6786137)
. And then there are also lists of items usually
separated by commas, but sometimes by semi-colons. A single Rosie pattern can
destructure all of these and more.
The destructure package
In the
Rosie community group on GitLab,
we have started a repository for working with “raw data”. The first
contribution is the destructure
package, which originated in the Pixiedust
Rosie project. One parent of
that project, Pixiedust, is a very
cool productivity tool for notebooks. Pixiedust makes it very easy to explore,
visualize, and manipulate data.
The addition of Rosie Pattern Language gives Pixiedust more capabilities, such
as automatically destructuring data – that is, recognizing when a column
contains entries like (35.7692755,-78.6786137)
and offering to break up such a
column into two new “synthetic” columns, one with the first coordinate and one
with the second. Similarly, when a column contains alphanumeric codes like
MAE214
, Pixiedust+Rosie will offer to split those codes into its alpha and
numeric parts, each in their own column.
The Rosie package
destructure.rpl
has patterns for recognizing a variety of structured data. And the pattern
destructure.tryall
does what it says: it tries all the various destructuring
patterns.
Here’s an example of destructure.tryall
at work:
You can see by the color output that Rosie recognized all of the structured patterns in the input: the lists that use semi-colons, commas, and dashes between items; the lists in parentheses or braces; and the items that have alphanumeric structure. The latter are displayed with the alpha part in blue and the numeric part in cyan.
The color and libpath settings in my ~/.rosierc
file tell Rosie where to find
the destructure library, and what colors to use. From my libpath
settings,
you can see that I have cloned two community repositories, lang
and rawdata
into my community
directory.
-- ~/.rosierc
libpath = "/usr/local/lib/rosie/rpl"
libpath = "/Users/jennings/Projects/community/lang"
libpath = "/Users/jennings/Projects/community/rawdata"
colors="destructure.find.<search>=red:destructure.alpha=blue:destructure.num=cyan"
Contribute patterns, code, or questions
The Rosie Community group on gitlab.com was created for contributions of patterns and tools. There are just a few repositories there now, but we expect this group to grow.
Feedback
Edit to contact information, August 15, 2023.
We welcome feedback and contributions. Please open issues (or merge requests) on GitLab, or get in touch by email.
You can find my contact information, including Mastodon and LinkedIn coordinates, on my personal blog. The mailing list https://groups.io/g/rosiepattern has fallen out of use since we mostly use Slack, but perhaps it will be revived.