Rosie the Riveter icon

Rosie Pattern Language

Modern text pattern matching to replace regex

Handling null characters

Last updated on 12 Jan 2019

I recently read this blog post about grepping for null bytes in text files.

Some versions of Unix grep (and many other tools) have inherited the unfortunate trait of treating null bytes specially, in large part because C language libraries use null to mark the end of a string.

While Rosie was developed on Unix/Linux, it was designed to handle any input, including null characters, invalid UTF-8, and indeed arbitrary byte sequences.

Rosie grep “\x00”

Our input file, hasnulls.txt, is a text file containing 144 (mostly) random bytes. Let’s use Rosie to search for the null bytes, using the RPL hex character notation \x00. On the bash command line, we use single quotes around the RPL expression we want to match, to prevent the shell from expanding or interpolating it.

We’ll use the -o subs output format, so that Rosie will print each match on its own line, making it easy to count the matches. (Technically, each null that Rosie finds is a sub-match of the search pattern, which is why this output format is named subs.)

Command is: rosie grep -o subs '"\x00"\' hasnulls.txt

We see in the above screen shot that there are 11 null bytes in the input file. Just to be sure, let’s translate the nulls in the Rosie output into a printable character like +.

Command is: 
rosie grep -o subs '"\x00"\' hasnulls.txt | tr "\000" "+"

Using json output to see character positions

Rosie will report the character positions (1-based) of the matches if we use the json output format. And since we are going to read it ourselves instead of piping into another program, we can use the pretty-printed json format, jsonpp.

Command is: rosie grep -o jsonpp '"\x00"\' hasnulls.txt

Recall that Rosie supports two pattern matching commands: match which returns the first match and starts at the beginning of the input, and grep which behaves more like Unix grep. Internally, Rosie grep is implemented using Rosie’s find macro, and each found instance is labeled in the output with the type find.*.

In the screen shot above, we can see that nulls were found at (start) positions 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, and 144. (This familiar sequence was ensured by the program that generated the sample input.)

Rosie and jq

Rosie’s json output can easily be the input to another program. A couple of useful programs for manipulating JSON on the command line are jq and jsonata. In the screen shot below,

  • the .subs part of the argument to jq selects the Rosie sub-matches,
  • the [] processes each element of the array, and
  • the .s selects the start position (1-based) for each sub-match.

Command is: rosie grep -o json '"\x00"' code/nulls/hasnulls.txt | jq '.subs[].s'

Finally, to identify sequences of nulls, we could of course search for "\x00"+. The same jq command as before will show us the start of each run of nulls. There is a single run of 3 nulls in the input file, starting at the first character.

Command is: rosie grep -o json '"\x00"+' hasnulls.txt | jq '.subs[].s'

Note that jq is plenty capable of calculating the length of each run as well. How to do that is left as an exercise. :-)

Questions and feedback are welcome

Please post on the Rosie subreddit.

You can also:

We welcome feedback and contributions. Please open issues (or merge requests) on GitLab, or get in touch by email.

Edit August 15, 2023: You can find my contact information, including Mastodon and LinkedIn coordinates, on my personal blog.