Handling null characters
I recently read this blog post about grepping for null bytes in text files.
Some versions of Unix grep (and many other tools) have inherited the
unfortunate trait of treating null bytes specially, in large part because C
language libraries use null to mark the end of a string.
While Rosie was developed on Unix/Linux, it was designed to handle any input, including null characters, invalid UTF-8, and indeed arbitrary byte sequences.
Rosie grep “\x00”
Our input file, hasnulls.txt, is a text file containing 144
(mostly) random bytes. Let’s use Rosie to search for the null bytes, using the
RPL hex character notation \x00. On the bash command line, we use single
quotes around the RPL expression we want to match, to prevent the shell from
expanding or interpolating it.
We’ll use the -o subs output format, so that Rosie will print each match on
its own line, making it easy to count the matches. (Technically, each null that
Rosie finds is a sub-match of the search pattern, which is why this output
format is named subs.)
We see in the above screen shot that there are 11 null bytes in the input file.
Just to be sure, let’s translate the nulls in the Rosie output into a printable
character like +.

Using json output to see character positions
Rosie will report the character positions (1-based) of the matches if we use the
json output format. And since we are going to read it ourselves instead of
piping into another program, we can use the pretty-printed json format,
jsonpp.
Recall that Rosie supports two pattern matching commands: match which returns
the first match and starts at the beginning of the input, and grep which
behaves more like Unix grep. Internally, Rosie grep is implemented using
Rosie’s find macro, and each found instance is labeled in the output with the
type find.*.
In the screen shot above, we can see that nulls were found at (start) positions 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, and 144. (This familiar sequence was ensured by the program that generated the sample input.)
Rosie and jq
Rosie’s json output can easily be the input to another program. A couple of
useful programs for manipulating JSON on the command line are
jq and jsonata. In the
screen shot below,
- the
.subspart of the argument tojqselects the Rosie sub-matches, - the
[]processes each element of the array, and - the
.sselects the start position (1-based) for each sub-match.
![Command is: rosie grep -o json '"\x00"' code/nulls/hasnulls.txt | jq '.subs[].s'](/images/Nulls-jq.png)
Finally, to identify sequences of nulls, we could of course search for
"\x00"+. The same jq command as before will show us the start of each run
of nulls. There is a single run of 3 nulls in the input file, starting at the
first character.
![Command is: rosie grep -o json '"\x00"+' hasnulls.txt | jq '.subs[].s'](/images/Nulls-in-runs.png)
Note that jq is plenty capable of calculating the length of each run as well.
How to do that is left as an exercise. :-)
Questions and feedback are welcome
Please post on the Rosie subreddit.
You can also:
- Open an issue on GitLab
- Send email to info@rosie-lang.org
We welcome feedback and contributions. Please open issues (or merge requests) on GitLab, or get in touch by email.
Edit August 15, 2023: You can find my contact information, including Mastodon and LinkedIn coordinates, on my personal blog.