develooper Front page | perl.perl5.porters | Postings from February 2017

Is this approach to checking for a program's UTF-8 wellformednessok?

Thread Next
Karl Williamson
February 9, 2017 22:10
Is this approach to checking for a program's UTF-8 wellformednessok?
Message ID:
I have been working on getting the lexer to find malformed UTF-8 input. 
The advice I originally got as to where to put the checks was in 
lex_next_chunk().  It turns out that there are two problems with this:

1) It didn't catch string evals.
2) UTF8ness can change in mid chunk, after we've examined the chunk.

The first can readily be handled by doing the same check in the eval 
portion of lex_start.

The second can be from something like

	use utf8; my $é = 0;

There's also the possibility of the line containing 'no utf8' after 
having been in utf8, and the text after it being invalid UTF-8.  This 
will generate a malformed error, when one really isn't called for.  I 
propose not to worry about this.  A file's encoding in practice really 
can't change from utf-8 to not.

What I've come to to handle the case where we find out we're in UTF-8 in 
mid chunk is this:

1) Changing to UTF-8 already triggers magic.  Put a hook there to call a 
new function in toke.c that sets a (new) flag in the parser object that 
indicates the remaining portion of the chunk needs to be rechecked for 
UTF-8 validity.

2) Test that flag at the entry to yylex, branch predicted to be off, and 
if set, recheck the input buffer, and clear the flag.

I've experimented with this, and it fixes all the relevant fuzzing cases 
that have been presented.  But since this is an area of the core I'm not 
very familiar with, I thought I should ask those who are if I'm missing 
something obvious.

Trying to have the hook called from mg.c do the recheck itself would 
avoid the check in yylex, but it causes problems in that the error 
message context is from, and not the call to  That 
could be worked around, but at extra complexity and brittleness. 
Checking the flag in yylex automatically gets the context correct.  It 
also avoids having to special case where the UTF8ness flag gets changed 
at runtime, too late to be effective.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About