I recently read this blog post about grepping for null bytes in text files.
Some versions of Unix
grep (and many other tools) have inherited the
unfortunate trait of treating null bytes specially, in large part because C
language libraries use null to mark the end of a string.
While Rosie was developed on Unix/Linux, it was designed to handle any input, including null characters, invalid UTF-8, and indeed arbitrary byte sequences.
Rosie grep “\x00”
Our input file,
hasnulls.txt, is a text file containing 144
(mostly) random bytes. Let’s use Rosie to search for the null bytes, using the
RPL hex character notation
\x00. On the
bash command line, we use single
quotes around the RPL expression we want to match, to prevent the shell from
expanding or interpolating it.
We’ll use the
-o subs output format, so that Rosie will print each match on
its own line, making it easy to count the matches. (Technically, each null that
Rosie finds is a sub-match of the search pattern, which is why this output
format is named
We see in the above screen shot that there are 11 null bytes in the input file.
Just to be sure, let’s translate the nulls in the Rosie output into a printable
Using json output to see character positions
Rosie will report the character positions (1-based) of the matches if we use the
json output format. And since we are going to read it ourselves instead of
piping into another program, we can use the pretty-printed json format,
Recall that Rosie supports two pattern matching commands:
match which returns
the first match and starts at the beginning of the input, and
behaves more like Unix
grep. Internally, Rosie
grep is implemented using
find macro, and each found instance is labeled in the output with the
In the screen shot above, we can see that nulls were found at (start) positions 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, and 144. (This familiar sequence was ensured by the program that generated the sample input.)
Rosie and jq
.subspart of the argument to
jqselects the Rosie sub-matches,
processes each element of the array, and
.sselects the start position (1-based) for each sub-match.
Finally, to identify sequences of nulls, we could of course search for
"\x00"+. The same
jq command as before will show us the start of each run
of nulls. There is a single run of 3 nulls in the input file, starting at the
jq is plenty capable of calculating the length of each run as well.
How to do that is left as an exercise. :-)
Questions and feedback are welcome
Please post on the Rosie subreddit.
You can also: