Unicode predicates in RPL

Last updated on 1 Apr 2018

Regex syntax has been extended over the years to allow matching of characters based on their Unicode properties. While there is considerable variation in the syntax and the behavior across implementations, the Perl syntax may be familar.

\p{Lu} matches a one codepoint in the Uppercase Letter category
\P{script=greek} matches one codepoint that is not in the Greek script

(If you are not familiar with Unicode, it will suffice for our purpose here to treat codepoint as a synonym for character.)

RPL already allows names to be bound to patterns, so there is no need to copy the Perl syntax, or any of the others from other regex implementations.

Unicode Predicates are in the Rosie Standard Library

In the RPL standard library are a set of files in the directory rpl/Unicode, each of which defines a set of patterns corresponding to some Unicode property. For example, Script.rpl defines Script.Greek, a pattern that matches a single codepoint in the Greek script.

For example, in RPL we have:

Category.Lu matches a one codepoint in the Uppercase Letter category
{!Script.Greek .} matches one codepoint that is not in the Greek script

To match a codepoint that is not in the Greek script, we use the RPL operator !, which means not looking at, and the pattern . which means any character. Maybe we should add an operator (or macro) to RPL to shorten this expression?

Then again, a user can define their own shorthands as they wish. In the rest of the examples, the input file sample.txt contains:

According to Google Translate, “Hello, world!” in Greek is “Γειά σου Κόσμε!”. The same phrase in Hebrew is “שלום עולם!”.

Here’s an example of defining a shorthand, NG, for non-Greek words:

$ rosie --rpl 'import Unicode/Script; NG = {!Script.Greek .}+' --colors '*=red' -o color grep NG sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
The same phrase in Hebrew is "שלום עולם!".
$

Definitions like NG in the example can be saved to a file that can be loaded automatically by your ~/.rosierc file. Similarly, frequently used libraries can be imported automatically in the same way.

Also, note that the Rosie CLI auto-imports packages from the library path. This is how patterns like net.ip may be used without needing an import statement. Currently, in the 1.0.0-beta release, the auto-import search looks only one directory deep, so it will not find Script.rpl because that file is in the subdirectory Unicode. Perhaps we will enhance the CLI to search into subdirectories? In that case, what should happen when two subdirectories have packages with the same name? Let us know what you think, on the Rosie subreddit.

Examples

In the transcript below are some examples of using Unicode predicates, the Rosie match and grep commands, and custom color output.

$ # Rosie's "grep" command outputs plain text by default
$ rosie --rpl 'import Unicode/Script' grep Script.Greek sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
$ # We can change the output to color, to see the Greek characters in bold (the default)
$ rosie --rpl 'import Unicode/Script' -o color grep Script.Greek sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
$ # And we can customize the color
$ rosie --rpl 'import Unicode/Script' -o color --colors 'Script.Greek=cyan;bold' grep Script.Greek sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
$ 
$ # Of course, "grep" is shorthand for using "match" with the "findall" macro,
$ # so we our last command is equivalent to this one:
$ rosie --rpl 'import Unicode/Script' --colors 'Script.Greek=cyan;bold' match findall:Script.Greek sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
$ 
$ # Let's define pattern names (and colors) for Greek words and Hebrew words
$ rosie --rpl 'import Unicode/Script; H=Script.Hebrew+; G=Script.Greek+' --colors 'H=blue;underline:G=cyan;bold' match 'findall:(H/G)' sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
The same phrase in Hebrew is "שלום עולם!".
$ 
$ # We can extract only the Hebrew and Greek words with the "subs" output format
$ rosie --rpl 'import Unicode/Script; H=Script.Hebrew+; G=Script.Greek+' -o subs match 'findall:(H/G)' sample.txt
Γειά 
σου 
Κόσμε
שלום 
עולם
$ # The "grep" command is a little simpler:
$ rosie --rpl 'import Unicode/Script; H=Script.Hebrew+; G=Script.Greek+' -o subs grep 'H/G' sample.txt
Γειά
σου
Κόσμε
שלום
עולם
$

Here’s another example, in which we match all the uppercase letters in the sample input:

$ rosie --rpl 'import Unicode/Category' --colors '*=red' match 'findall:Category.Lu' sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
The same phrase in Hebrew is "שלום עולם!".
$

Using Unicode Predicates Inside RPL Character Classes

RPL character class syntax resembles regex character class syntax, but is more, well, regular. A simple character class is enclosed in brackets and is only one of:

a range, like [A-Z], or its complement [^A-Z]
a set, like [abc], or its complement [^abc]
a Posix named set, like [:digit:], or its complement [:^digit:]

Compound character sets in RPL are made by enclosing simple character sets in brackets, such as:

[[A-Z] [abc]] which matches A through Z as well as a, b, or c
[^ [A-Z] [:digit:] ] which matches a character that is not in A-Z and not a digit

Our earlier post on [RPL Character Sets]({% post_url 2018-01-08-RPL-Character-Sets %}) describes the design, rationale, and capabilities in more detail. An important decision was to allow RPL pattern names inside a compound character set.

A compound character set is defined by brackets enclosing one or more simple sets, so we must include a simple character set, even an empty one, if we want to use an RPL pattern name inside a character set. For example:

$ rosie --rpl 'import Unicode/Script' grep -o color 'Script.Greek' sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
$ 
$ rosie --rpl 'import Unicode/Script' grep -o color '{!Script.Greek .}' sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
The same phrase in Hebrew is "שלום עולם!".
$ 
$ rosie --rpl 'import Unicode/Script' grep -o color '[^[] Script.Greek]' sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
The same phrase in Hebrew is "שלום עולם!".
$ 
$ rosie --rpl 'import Unicode/Script' grep -o color --colors '*=green' '[^[] Script.Greek]' sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
The same phrase in Hebrew is "שלום עולם!".
$

Limitations

Some Unicode properties are not yet supported by RPL, such as East Asian Width. We will add the remaining Unicode properties to the Rosie Standard Library.

Today, the Unicode Derived Properties are not included in the RPL library. These are property names like Lowercase, which is defined as the union of Ll and Other_Lowercase. Similarly, the Unicode specification defines the Category L and its alias Letter to be the disjunction Lu | Ll | Lt | Lm | Lo.

Property aliases are not yet supported. For example, today you write Category.Lu for upper case letters, but in the future, when Unicode aliases are supported, you will be able to write Category.Uppercase_Letter instead, if you wish.

Additions

The Posix character sets are defined in terms of the ASCII character set (and encoding). In RPL, as in regex, they are written like [:alpha:]. The same names are defined in the RPL library Unicode/Ascii (in the file rpl/Unicode/Ascii.rpl).

$ rosie --rpl 'import Unicode/Ascii' --colors '*=red' match 'findall:Ascii.alpha' sample.txt
According to Google Translate, "Hello, world!" in Greek is "Γειά σου Κόσμε!".
The same phrase in Hebrew is "שלום עולם!".
$

Goals

I’ll conclude this post with a brief summary of the goals we had in designing our Unicode predicate support.

Support all Unicode properties and property values that are useful for matching.
Support the latest and future Unicode standards (modulo structural changes to the standard) by automating the generation of RPL patterns from the Unicode Character Database.
Allow Unicode and other named patterns to be used within a compound character set.
Support union, intersection, and difference operations between character sets. See [RPL Character Sets]({% post_url 2018-01-08-RPL-Character-Sets %}).
Stay consistent with the RPL design, avoiding cryptic syntax as much as possible, and also avoiding ambiguity, both perceived and actual.

RPL will never be as concise as regex. Indeed, a major objective in designing RPL was to make pattern expressions easier to understand. Having full names and allowing whitespace helps that goal, while requiring a little more typing.

At the command line, we do some things to simplify using Rosie, like auto-import and reading a ~/.rosierc file. But we can do more. Auto-importing can search more extensively, for instance.

And we could supply in the standard library a set of short names for patterns that have long names. Then again, users can do this for themselves, and we hope they will, and that they will post their personal libraries of abbreviations!

Feedback

Edit to contact information, August 15, 2023.

We welcome feedback and contributions. Please open issues (or merge requests) on GitLab, or get in touch by email.

You can find my contact information, including Mastodon and LinkedIn coordinates, on my personal blog. The mailing list https://groups.io/g/rosiepattern has fallen out of use since we mostly use Slack, but perhaps it will be revived.