develooper Front page | perl.perl5.porters | Postings from December 2017

[perl #105920] Perl parser sometimes tolerates stray nulls,sometimes not

December 13, 2017 08:08
[perl #105920] Perl parser sometimes tolerates stray nulls,sometimes not
Message ID:
Father Chrysostomos wrote:
>Stray nulls are tolerated in files but not in string evals.  Why is that?
>I know which piece of code is doing it, but is it by design?  Why the

Prior to Perl 3.0, the tokeniser made no attempt to handle NULs in source.
It perceived a null byte strictly as indicating the end of the buffer.
The treatment of end of buffer differs depending on whether we're parsing
a string or a file: if it's a string then end of buffer is end of source,
but if it's a file then we try to read more into the buffer.  When end
of file is reached, in some cases the tokeniser has some epilogue code
that it puts in the buffer and continues tokenising in string mode.

Perl 3.0 in 1989 was the first version that claimed to support binary
data.  The tokeniser was changed to generally cope with the possibility
of NULs in source and so in the buffer.  They're accepted as literal
characters in string syntax, for example.  In general tokenising context,
then as now, a source NUL and end of buffer are both initially routed
to the same branch of the main switch.  One would think that at that
point the first priority would be to distinguish source NUL from end of
buffer, but actually the new check for that wasn't put first.  Instead,
the old check for string vs file was left as the first thing, with the
check for buffer end coming next, only executed if parsing from a file.

The treatment of a NUL, when detected, was as it is now, to skip past it.
The line in toke.c with the comment /* ignore stray nulls */ dates back
to 3.0; clearly to ignore NULs in files was intentional behaviour.

Indeed, it's a time-honoured interpretation of a NUL: all bits zero means
this is a blank part of the paper tape that needs to be skipped past.
One could edit a tape by punching new content into sections left blank for
this purpose.  (This is the origin of DEL: all bits one means there used
to be some content here but it's been erased by punching all positions,
so this too needs to be skipped over.)

Of course, this time-honoured treatment of NUL is more at home in the
1960s than the 1980s, let alone today.  Few of us use paper tape any more.
(The characteristics of Flash memory have revived interest in data
structures designed for overpunch-type editing, but in that context it's
not often applied to ASCII.)  And if NULs are skipped on that basis then
it makes no conceptual sense to treat them differently depending on the
grammatical context, as is done by accepting NULs as literals in strings.
Even having NULs interrupt barewords, which Perl 3.0 and blead both
do, is incompatible with that interpretation.  The syntactic treatment
of NULs in files is actually as if they're whitespace characters, an
interpretation that has much less historical justification.

Anyway, was it intentional that NULs in strings are errors?  The answer
is in fact that the question is wrong.  NULs in strings *aren't* errors,
and never have been.  Did you notice that the error message you get is a
"syntax error", not the "unrecognized character" that you get by including
an arbitrary invalid character?

    $ perl -e 'eval "3+4\0"; print $@'
    syntax error at (eval 1) line 1, at EOF
    $ perl -e 'eval "3+4\xa1"; print $@'
    Unrecognized character \xA1; marked by <-- HERE after 3+4<-- HERE near column 4 at (eval 1) line 1.

The error is coming from the yacc parser, not from the tokeniser.
The tokeniser is interpreting the null byte in the pre-3.0 way: as the
end of the string, and so the end of the source.  It returns a YYEOF
token to perly.  If the string content up to that point was good, then
parsing will succeed:

    $ perl -e 'eval "print qq(hi\\n);\0garbage here"; print $@ || "OK\n"'

Why do you usually get a syntax error?  Because of semicolons.
The grammar specifies that normal statements must end with a semicolon.
The semicolon is implicit at the end of the source, and in the case of
an eval this is and always was implemented by appending a semicolon
character to the string before feeding it to the tokeniser.  If you
terminate the string early with a NUL, you don't get the benefit of the
implicit semicolon.

It is clearly a bug that NUL in an eval string terminates tokenisation
early.  Perl 3.0 didn't quite live up to its hype about binary
cleanliness, and neither has any subsequent Perl.  The intent for Perl
3.0 was obviously that NULs would consistently behave as whitespace.
It would have achieved that goal had two conditions just been tested in
the opposite order.  We could fulfil that intent in today's Perl by just
moving two lines of code, fixing that old bug.

However, as a matter of language design, it seems silly to be treating
NULs like this.  NUL isn't a whitespace character.  A more appealing
way to resolve the inconsistency is to deprecate both of the current
NUL behaviours, eventually making NUL illegal in general tokenisation
context, just like most other control characters are.

-zefram Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About