I have been working on getting the lexer to find malformed UTF-8 input. The advice I originally got as to where to put the checks was in lex_next_chunk(). It turns out that there are two problems with this: 1) It didn't catch string evals. 2) UTF8ness can change in mid chunk, after we've examined the chunk. The first can readily be handled by doing the same check in the eval portion of lex_start. The second can be from something like use utf8; my $é = 0; There's also the possibility of the line containing 'no utf8' after having been in utf8, and the text after it being invalid UTF-8. This will generate a malformed error, when one really isn't called for. I propose not to worry about this. A file's encoding in practice really can't change from utf-8 to not. What I've come to to handle the case where we find out we're in UTF-8 in mid chunk is this: 1) Changing to UTF-8 already triggers magic. Put a hook there to call a new function in toke.c that sets a (new) flag in the parser object that indicates the remaining portion of the chunk needs to be rechecked for UTF-8 validity. 2) Test that flag at the entry to yylex, branch predicted to be off, and if set, recheck the input buffer, and clear the flag. I've experimented with this, and it fixes all the relevant fuzzing cases that have been presented. But since this is an area of the core I'm not very familiar with, I thought I should ask those who are if I'm missing something obvious. Trying to have the hook called from mg.c do the recheck itself would avoid the check in yylex, but it causes problems in that the error message context is from utf8.pm, and not the call to utf8.pm. That could be worked around, but at extra complexity and brittleness. Checking the flag in yylex automatically gets the context correct. It also avoids having to special case where the UTF8ness flag gets changed at runtime, too late to be effective.Thread Next