Handling null characters
I recently read this blog post about grepping for null bytes in text files.
Some versions of Unix grep
(and many other tools) have inherited the
unfortunate trait of treating null bytes specially, in large part because C
language libraries use null to mark the end of a string.
While Rosie was developed on Unix/Linux, it was designed to handle any input, including null characters, invalid UTF-8, and indeed arbitrary byte sequences.
Rosie grep “\x00”
Our input file, hasnulls.txt
, is a text file containing 144
(mostly) random bytes. Let’s use Rosie to search for the null bytes, using the
RPL hex character notation \x00
. On the bash
command line, we use single
quotes around the RPL expression we want to match, to prevent the shell from
expanding or interpolating it.
We’ll use the -o subs
output format, so that Rosie will print each match on
its own line, making it easy to count the matches. (Technically, each null that
Rosie finds is a sub-match of the search pattern, which is why this output
format is named subs
.)
We see in the above screen shot that there are 11 null bytes in the input file.
Just to be sure, let’s translate the nulls in the Rosie output into a printable
character like +
.
Using json output to see character positions
Rosie will report the character positions (1-based) of the matches if we use the
json
output format. And since we are going to read it ourselves instead of
piping into another program, we can use the pretty-printed json format,
jsonpp
.
Recall that Rosie supports two pattern matching commands: match
which returns
the first match and starts at the beginning of the input, and grep
which
behaves more like Unix grep
. Internally, Rosie grep
is implemented using
Rosie’s find
macro, and each found instance is labeled in the output with the
type find.*
.
In the screen shot above, we can see that nulls were found at (start) positions 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, and 144. (This familiar sequence was ensured by the program that generated the sample input.)
Rosie and jq
Rosie’s json
output can easily be the input to another program. A couple of
useful programs for manipulating JSON on the command line are
jq and jsonata. In the
screen shot below,
- the
.subs
part of the argument tojq
selects the Rosie sub-matches, - the
[]
processes each element of the array, and - the
.s
selects the start position (1-based) for each sub-match.
Finally, to identify sequences of nulls, we could of course search for
"\x00"+
. The same jq
command as before will show us the start of each run
of nulls. There is a single run of 3 nulls in the input file, starting at the
first character.
Note that jq
is plenty capable of calculating the length of each run as well.
How to do that is left as an exercise. :-)
Questions and feedback are welcome
Please post on the Rosie subreddit.
You can also:
- Open an issue on GitLab
- Send email to info@rosie-lang.org
We welcome feedback and contributions. Please open issues (or merge requests) on GitLab, or get in touch by email.
Edit August 15, 2023: You can find my contact information, including Mastodon and LinkedIn coordinates, on my personal blog.