Using Rosie from Python
For some time, Rosie has had a Python module, but it was undocumented. Until
now, you had to read the code to understand how to use it. In this post, we’ll
look at rosie.py
, which exposes the Rosie Pattern Language functionality to
Python programmers.
Install with pip
The easiest way to install Rosie for use with Python is simply:
pip install rosie
Then, from within your Python program, you only need to import the rosie package to get started:
import rosie
As of this writing (April, 2018), you will get the latest beta version.
The installation consists mainly of librosie
, rosie.py
, and the standard
library of rpl patterns. If there is a wheel for your platform, the install
will run quickly, simply copying a few files. If not, then pip
will download
the source distribution and build librosie
, which requires cc/gcc and make.
This installation method puts a private copy of Rosie wherever your Python
modules are stored. As a result, pip uninstall rosie
will remove all traces
of the installation.
If you already have Rosie installed, you will now have two independent
installations. However, you can call rosie.load(path)
at the start of your
Python program to load the librosie
that you have already installed (e.g. in
/usr/local
). Using rosie.load()
is optional; without it, your Python code
will use the Rosie installation in your Python directory.
Programming model
The programming model for using Rosie is similar to PCRE, RE2, and other pattern matching libraries. Roughly speaking, the steps are:
- Create a matching engine
- Optionally load some pattern definitions into the engine
- Compile a pattern
- Use the compiled pattern to match against input data
Here’s an example (without any error checking):
# matchall.py
from __future__ import print_function
import rosie, sys
e = rosie.engine()
e.import_pkg('all')
pat, errs = e.compile('all.things')
print(e.match(pat, sys.argv[1], 1, 'color')[0])
$ python matchall.py 'Today is April 14, 2018. IP 1.2.3.4; 1 mole = 6.02e23; www.ibm.com/a/b/c' Today is April 14, 2018. IP 1.2.3.4; 1 mole = 6.02e23; www.ibm.com/a/b/c $
Engines
In the example, rosie.engine()
creates a matching engine. You can create as
many engines as you want or need. (They will be garbage collected when they are
no longer accessible.)
An engine’s state includes a set of pattern definitions that have been loaded. When you make a new engine, it has only a limited set of built-in patterns.
Components of an engine’s state:
Name | Description | Sample value |
---|---|---|
Environment | Patterns loaded (available for use) | $ , num.int , net.ipv4 |
Library path | List of directories to search when importing | /usr/local/lib/rosie/rpl:~/rpl |
Colors | Map from pattern names to colors for colorized output | foo=green;bold |
Examining and setting the environment
`librosie` can tell you what patterns are loaded, but this API is not yet in `rosie.py`.New pattern definitions are added to the environment of an engine using the
load
and import_pkg
functions (see below).
Examining and setting the library path
The method e.libpath()
returns the current library path. When called with a
string argument, it sets the library path.
The full configuration of an engine, including the library path, can be obtained
via e.config()
, which returns a JSON encoding of two lists of configuration
objects. The first list are settings of the Rosie installation, and the second
list are settings of this particular engine.
Loading pattern definitions
There are three ways to load pattern definitions into a Rosie matching engine:
- from a string
- from an arbitrary file
- by importing a package
API | Argument | Description |
---|---|---|
e.load(s) |
Python bytes | RPL statements, UTF-8 encoded in a byte array |
e.loadfile(fn) |
Filename | File of RPL statements (UTF-8) |
e.import_pkg(pn) |
Package name (e.g. net ) |
Name of a package on the libpath |
The most common RPL statements bind expressions to a name, e.g.
int = 0 / [1-9][0-9]*
A set of statements form a block, and a block can optionally include import
statements, a package
declaration, and an rpl
(language version)
declaration.
A block that includes a package
declaration is, of course, a package. Names
inside a package are referred to using the package name as a prefix,
e.g. net.ipv4
.
When loading RPL from strings or files, the RPL statements may or may not form a
package. When import
ing a package, the file that implements the package must
declare the package name. Rosie searches for packages in each directory on its
libpath
. The libpath
of an engine can be changed at any time, and will
affect subsequent calls to import_pkg
.
Matching options
As we saw in the example at the start of the Programming Model section above,
the match
method of an engine takes 4 arguments:
- a compiled pattern,
- the input data,
- a start position within the input data (1-based), and
- the choice of output encoding.
A compiled pattern is returned by the engine’s compile(exp)
method, where the
argument, exp
, is a Python bytes object encoding an RPL expression in
UTF-8. In the example above, the expression is simply a reference to the
pattern all.things
, which was already loaded (via import_pkg
).
Some of the output encoding options are:
Output encoder | Returns |
---|---|
bool | True if the pattern matched, and False otherwise |
byte | Byte array which compactly encodes a match |
color | String that may include ANSI escape sequences |
json | String encoding a match structure in JSON |
jsonpp | String encoding a match as pretty-printed JSON |
line | The input, if the pattern matched, and False otherwise |
matches | The portion of the input that matched |
subs | A linefeed-separated list of the submatches |
Most of the output encoders were designed for the Rosie CLI, and in particular for human consumption. The formats most applicable for Python users are likely to be:
Output encoder | Usage |
---|---|
json | Decode with json.loads() or equivalent to obtain a match structure |
bool | Faster than json for when you want to know only if there was a match |
Return values from match()
The full set of return values from engine.match()
are:
Return value | Description |
---|---|
match | None or a match structure (see below) |
leftover | number of characters left unmatched (0 when entire input was matched) |
abend | 0 when matching ended abnormally via the RPL error macro |
total time | microseconds of processor time consumed by match() (see note below) |
match time | microseconds of processor time spent in the matching vm (see note below) |
Note about time values
The two time values returned by match()
are a crude but useful measure of
performance. The total time includes call overhead, matching, and encoding the
output. The match time includes only the time spent in the matching vm, and is
therefore independent of the choice of output encoder.
Important note: The last step of RPL pattern compilation is code generation
for the matching vm, and it is performed the first time the pattern is used in
match()
. So the cost of the low-level code generation, which is done once per
pattern, is reflected in the match time the first time a pattern is used.
This is true for Rosie v1.0.0-beta, and may change in the future.
The match structure
A Rosie match structure consists of the following fields:
Field name | Description |
---|---|
type | the RPL pattern name that matched |
data | the input data that matched this pattern |
s | the start position of the match, a 1-based byte index |
e | the end position of the match (see note below) |
subs | None or a list of submatches, each of which is a match structure |
Note on start/end indices: The end position, e
, is the first byte of the
first character after the match. In other words, if input
is a byte array
holding the input data, then the match consists of the characters in the
Python slice input[s-1:e-1]
. Note the adjustment to Python’s 0-based
indexing.
Note on the match data: If the input data is valid UTF-8, then both s
and
e
will point to the first byte of a valid UTF-8 encoded character, and the
data
field will be valid UTF-8.
Python example:
>>> d, err = e.compile('date.any') >>> json.loads(e.match(d, datetime.datetime.now().isoformat(), 1, 'json')[0]) {u'e': 11, u's': 1, u'type': u'date.any', u'subs': [{u'e': 11, u's': 1, u'type': u'date.dashed', u'subs': [{u'e': 5, u's': 1, u'type': u'date.year', u'data': u'2018'}, {u'e': 8, u's': 6, u'type': u'date.month', u'data': u'04'}, {u'e': 11, u's': 9, u'type': u'date.day', u'data': u'19'}], u'data': u'2018-04-19'}], u'data': u'2018-04-19'} >>>
CLI example, for comparison:
$ date | rosie match date.any Thu Apr 19 05:42:01 EDT 2018 $ date | rosie -o jsonpp match date.any {"type": "date.any", "e": 12, "s": 1, "subs": [{"type": "date.us_long", "e": 12, "s": 1, "subs": [{"type": "date.day_name", "e": 4, "s": 1, "subs": [{"type": "date.day_shortname", "e": 4, "s": 1, "data": "Thu"}], "data": "Thu"}, {"type": "date.month_name", "e": 8, "s": 5, "subs": [{"type": "date.month_shortname", "e": 8, "s": 5, "data": "Apr"}], "data": "Apr"}, {"type": "date.day", "e": 11, "s": 9, "data": "19"}], "data": "Thu Apr 19 "}], "data": "Thu Apr 19 "} $
Debugging
Tools for debugging patterns include:
- the
trace
command of the CLI and the.trace
command of the REPL - the
trace()
method
The tracing capability prints out a graphical depiction of the matching process, as shown in the example below.
The trace data is available today in `full` and `condensed` formats, both of which are strings suitable for printing, i.e. for human consumption. Rosie is capable of returning a JSON-encoded tree structure containing the trace data, but the API to do this is not yet implemented.>>> print e.trace(d, datetime.datetime.now().isoformat(), 1, 'condensed')[1] Expression: {us / eur / dashed / slashed / rfc2822 / rfc3339 / spaced_en / spaced} Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) Matched 10 chars ├── Expression: us │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ No match │ └── Expression: {us_dashed / us_slashed / us_long / us_short} │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ No match │ ├── Expression: us_dashed │ │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ │ No match │ │ └── Expression: {month "-" day "-" short_long_year} │ │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ │ No match │ │ ├── Expression: month │ │ │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ │ │ Matched 1 chars │ │ │ └── Expression: ||"1" [0-2]} / ||"0"}? [1-9]}} │ │ │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ │ │ Matched 1 chars │ │ │ ├── Expression: {"1" [0-2]} │ │ │ │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ │ │ │ No match │ │ │ │ ├── Expression: "1" │ │ │ │ │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ │ │ │ │ No match │ │ │ │ └── Expression: [0-2] │ │ │ │ Not attempted │ │ │ └── Expression: ||"0"}? [1-9]} │ │ │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ │ │ Matched 1 chars │ │ ├── Expression: "-" │ │ │ Looking at: 《018-04-19T06:12:34.591774》 (input pos = 2) │ │ │ No match │ │ ├── Expression: day │ │ │ Not attempted │ │ ├── Expression: "-" │ │ │ Not attempted │ │ └── Expression: short_long_year │ │ Not attempted │ ├── Expression: us_slashed │ │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ │ No match │ ├── Expression: us_long │ │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ │ No match │ └── Expression: us_short │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ No match ├── Expression: eur │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ No match ├── Expression: dashed │ Looking at: 《2018-04-19T06:12:34.591774》 (input pos = 1) │ Matched 10 chars ├── Expression: slashed │ Not attempted ├── Expression: rfc2822 │ Not attempted ├── Expression: rfc3339 │ Not attempted ├── Expression: spaced_en │ Not attempted └── Expression: spaced Not attempted >>>
Examples
A very simple example of using Rosie from Python is the program generic_sloc.py in the Rosie examples directory on Gitlab.
A simpler API should wrap the current one
Today, rosie.py
is low level and not very Pythonic. It is important to expose
librosie
functionality at a low level, so this is a good start. However, we
need a layer on top of rosie.py
that is easier to use.
I suspect that every Rosie + Python user has developed their own small interface layer
to suit their needs. We would like to have a clean high-level interface for
general use. If you are interested in contributing a layer over rosie.py
,
please get in touch by email (link in side menu, at left)
or by opening a Gitlab issue.
Discussion on reddit
A Rosie subreddit has been created for discussion of these posts and for questions about Rosie and RPL. See you there!
Follow us on Twitter for announcements about the RPL approach to #modernpatternmatching.