Using Rosie from Python

Last updated on 19 Apr 2018

For some time, Rosie has had a Python module, but it was undocumented. Until now, you had to read the code to understand how to use it. In this post, we’ll look at rosie.py, which exposes the Rosie Pattern Language functionality to Python programmers.

Install with pip

The easiest way to install Rosie for use with Python is simply:

pip install rosie

Then, from within your Python program, you only need to import the rosie package to get started:

import rosie

As of this writing (April, 2018), you will get the latest beta version. The installation consists mainly of librosie, rosie.py, and the standard library of rpl patterns. If there is a wheel for your platform, the install will run quickly, simply copying a few files. If not, then pip will download the source distribution and build librosie, which requires cc/gcc and make.

This installation method puts a private copy of Rosie wherever your Python modules are stored. As a result, pip uninstall rosie will remove all traces of the installation.

If you already have Rosie installed, you will now have two independent installations. However, you can call rosie.load(path) at the start of your Python program to load the librosie that you have already installed (e.g. in /usr/local). Using rosie.load() is optional; without it, your Python code will use the Rosie installation in your Python directory.

Programming model

The programming model for using Rosie is similar to PCRE, RE2, and other pattern matching libraries. Roughly speaking, the steps are:

Create a matching engine
Optionally load some pattern definitions into the engine
Compile a pattern
Use the compiled pattern to match against input data

Here’s an example (without any error checking):

# matchall.py
from __future__ import print_function
import rosie, sys
e = rosie.engine()
e.import_pkg('all')
pat, errs = e.compile('all.things')
print(e.match(pat, sys.argv[1], 1, 'color')[0])

$ python matchall.py 'Today is April 14, 2018.  IP 1.2.3.4; 1 mole = 6.02e23; www.ibm.com/a/b/c'
Today is April 14, 2018.  IP 1.2.3.4; 1 mole = 6.02e23; www.ibm.com/a/b/c
$

Engines

In the example, rosie.engine() creates a matching engine. You can create as many engines as you want or need. (They will be garbage collected when they are no longer accessible.)

An engine’s state includes a set of pattern definitions that have been loaded. When you make a new engine, it has only a limited set of built-in patterns.

Components of an engine’s state:

Name	Description	Sample value
Environment	Patterns loaded (available for use)	`$`, `num.int`, `net.ipv4`
Library path	List of directories to search when importing	`/usr/local/lib/rosie/rpl:~/rpl`
Colors	Map from pattern names to colors for colorized output	`foo=green;bold`

Examining and setting the environment

`librosie` can tell you what patterns are loaded, but this API is not yet in `rosie.py`.

New pattern definitions are added to the environment of an engine using the load and import_pkg functions (see below).

Examining and setting the library path

The method e.libpath() returns the current library path. When called with a string argument, it sets the library path.

The full configuration of an engine, including the library path, can be obtained via e.config(), which returns a JSON encoding of two lists of configuration objects. The first list are settings of the Rosie installation, and the second list are settings of this particular engine.

Loading pattern definitions

There are three ways to load pattern definitions into a Rosie matching engine:

from a string
from an arbitrary file
by importing a package

API	Argument	Description
`e.load(s)`	Python bytes	RPL statements, UTF-8 encoded in a byte array
`e.loadfile(fn)`	Filename	File of RPL statements (UTF-8)
`e.import_pkg(pn)`	Package name (e.g. `net`)	Name of a package on the `libpath`

The most common RPL statements bind expressions to a name, e.g.

int = 0 / [1-9][0-9]*

A set of statements form a block, and a block can optionally include import statements, a package declaration, and an rpl (language version) declaration.

A block that includes a package declaration is, of course, a package. Names inside a package are referred to using the package name as a prefix, e.g. net.ipv4.

When loading RPL from strings or files, the RPL statements may or may not form a package. When importing a package, the file that implements the package must declare the package name. Rosie searches for packages in each directory on its libpath. The libpath of an engine can be changed at any time, and will affect subsequent calls to import_pkg.

Matching options

As we saw in the example at the start of the Programming Model section above, the match method of an engine takes 4 arguments:

a compiled pattern,
the input data,
a start position within the input data (1-based), and
the choice of output encoding.

A compiled pattern is returned by the engine’s compile(exp) method, where the argument, exp, is a Python bytes object encoding an RPL expression in UTF-8. In the example above, the expression is simply a reference to the pattern all.things, which was already loaded (via import_pkg).

Some of the output encoding options are:

Output encoder	Returns
bool	True if the pattern matched, and False otherwise
byte	Byte array which compactly encodes a match
color	String that may include ANSI escape sequences
json	String encoding a match structure in JSON
jsonpp	String encoding a match as pretty-printed JSON
line	The input, if the pattern matched, and False otherwise
matches	The portion of the input that matched
subs	A linefeed-separated list of the submatches

Most of the output encoders were designed for the Rosie CLI, and in particular for human consumption. The formats most applicable for Python users are likely to be:

Output encoder	Usage
json	Decode with `json.loads()` or equivalent to obtain a match structure
bool	Faster than `json` for when you want to know only if there was a match

Matches encoded using the compact `byte` format are much smaller than their `json` equivalent, and faster to decode. We have not yet implemented a decoder for Python, which would accept Rosie's `byte` format and return a match data structure (a Python dictionary). The format is simple ([here is a decoder in C that produces a Lua table](https://gitlab.com/rosie-pattern-language/rosie-lpeg/blob/a84a89e50a74ac2554126aba37a202f326fdc645/src/lpcap.c#L617-L678)) and if someone wants to contribute a Python decoder, we'd appreciate it!

Return values from match()

The full set of return values from engine.match() are:

Return value	Description
match	`None` or a match structure (see below)
leftover	number of characters left unmatched (0 when entire input was matched)
abend	0 when matching ended abnormally via the RPL `error` macro
total time	microseconds of processor time consumed by `match()` (see note below)
match time	microseconds of processor time spent in the matching vm (see note below)

Note about time values

The two time values returned by match() are a crude but useful measure of performance. The total time includes call overhead, matching, and encoding the output. The match time includes only the time spent in the matching vm, and is therefore independent of the choice of output encoder.

Important note: The last step of RPL pattern compilation is code generation for the matching vm, and it is performed the first time the pattern is used in match(). So the cost of the low-level code generation, which is done once per pattern, is reflected in the match time the first time a pattern is used. This is true for Rosie v1.0.0-beta, and may change in the future.

The match structure

A Rosie match structure consists of the following fields:

Field name	Description
type	the RPL pattern name that matched
data	the input data that matched this pattern
s	the start position of the match, a 1-based byte index
e	the end position of the match (see note below)
subs	`None` or a list of submatches, each of which is a match structure

Note on start/end indices: The end position, e, is the first byte of the first character after the match. In other words, if input is a byte array holding the input data, then the match consists of the characters in the Python slice input[s-1:e-1]. Note the adjustment to Python’s 0-based indexing.

Note on the match data: If the input data is valid UTF-8, then both s and e will point to the first byte of a valid UTF-8 encoded character, and the data field will be valid UTF-8.

Python example:

>>> d, err = e.compile('date.any')
>>> json.loads(e.match(d, datetime.datetime.now().isoformat(), 1, 'json')[0])
{u'e': 11, u's': 1, u'type': u'date.any', u'subs': [{u'e': 11, u's': 1, u'type': u'date.dashed', u'subs': [{u'e': 5, u's': 1, u'type': u'date.year', u'data': u'2018'}, {u'e': 8, u's': 6, u'type': u'date.month', u'data': u'04'}, {u'e': 11, u's': 9, u'type': u'date.day', u'data': u'19'}], u'data': u'2018-04-19'}], u'data': u'2018-04-19'}
>>>

CLI example, for comparison:

$ date | rosie match date.any
Thu Apr 19 05:42:01 EDT 2018
$ date | rosie -o jsonpp match date.any
{"type": "date.any", 
 "e": 12, 
 "s": 1, 
 "subs": 
   [{"type": "date.us_long", 
     "e": 12, 
     "s": 1, 
     "subs": 
       [{"type": "date.day_name", 
         "e": 4, 
         "s": 1, 
         "subs": 
           [{"type": "date.day_shortname", 
             "e": 4, 
             "s": 1, 
             "data": "Thu"}], 
         "data": "Thu"}, 
        {"type": "date.month_name", 
         "e": 8, 
         "s": 5, 
         "subs": 
           [{"type": "date.month_shortname", 
             "e": 8, 
             "s": 5, 
             "data": "Apr"}], 
         "data": "Apr"}, 
        {"type": "date.day", 
         "e": 11, 
         "s": 9, 
         "data": "19"}], 
     "data": "Thu Apr 19 "}], 
 "data": "Thu Apr 19 "}
$

Debugging

Tools for debugging patterns include:

the trace command of the CLI and the .trace command of the REPL
the trace() method

The tracing capability prints out a graphical depiction of the matching process, as shown in the example below.

The trace data is available today in `full` and `condensed` formats, both of which are strings suitable for printing, i.e. for human consumption. Rosie is capable of returning a JSON-encoded tree structure containing the trace data, but the API to do this is not yet implemented.

>>> print e.trace(d, datetime.datetime.now().isoformat(), 1, 'condensed')[1]
Expression: {us / eur / dashed / slashed / rfc2822 / rfc3339 / spaced_en / spaced}
Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
Matched 10 chars
├── Expression: us
│   Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│   No match
│   └── Expression: {us_dashed / us_slashed / us_long / us_short}
│       Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│       No match
│       ├── Expression: us_dashed
│       │   Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│       │   No match
│       │   └── Expression: {month "-" day "-" short_long_year}
│       │       Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│       │       No match
│       │       ├── Expression: month
│       │       │   Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│       │       │   Matched 1 chars
│       │       │   └── Expression: ||"1" [0-2]} / ||"0"}? [1-9]}}
│       │       │       Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│       │       │       Matched 1 chars
│       │       │       ├── Expression: {"1" [0-2]}
│       │       │       │   Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│       │       │       │   No match
│       │       │       │   ├── Expression: "1"
│       │       │       │   │   Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│       │       │       │   │   No match
│       │       │       │   └── Expression: [0-2]
│       │       │       │       Not attempted
│       │       │       └── Expression: ||"0"}? [1-9]}
│       │       │           Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│       │       │           Matched 1 chars
│       │       ├── Expression: "-"
│       │       │   Looking at: 《018-04-19T06:12:34.591774》 (input pos = 2)
│       │       │   No match
│       │       ├── Expression: day
│       │       │   Not attempted
│       │       ├── Expression: "-"
│       │       │   Not attempted
│       │       └── Expression: short_long_year
│       │           Not attempted
│       ├── Expression: us_slashed
│       │   Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│       │   No match
│       ├── Expression: us_long
│       │   Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│       │   No match
│       └── Expression: us_short
│           Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│           No match
├── Expression: eur
│   Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│   No match
├── Expression: dashed
│   Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1)
│   Matched 10 chars
├── Expression: slashed
│   Not attempted
├── Expression: rfc2822
│   Not attempted
├── Expression: rfc3339
│   Not attempted
├── Expression: spaced_en
│   Not attempted
└── Expression: spaced
    Not attempted
>>>

Examples

A very simple example of using Rosie from Python is the program generic_sloc.py in the Rosie examples directory on Gitlab.

A simpler API should wrap the current one

Today, rosie.py is low level and not very Pythonic. It is important to expose librosie functionality at a low level, so this is a good start. However, we need a layer on top of rosie.py that is easier to use.

I suspect that every Rosie + Python user has developed their own small interface layer to suit their needs. We would like to have a clean high-level interface for general use. If you are interested in contributing a layer over rosie.py, please get in touch by email (link in side menu, at left) or by opening a Gitlab issue.

Discussion on reddit

A Rosie subreddit has been created for discussion of these posts and for questions about Rosie and RPL. See you there!