develooper Front page | perl.perl5.porters | Postings from April 2018

Re: [perl #133101] Anomalies in handling malformed utf8 input

Thread Previous | Thread Next
Karl Williamson
April 12, 2018 15:36
Re: [perl #133101] Anomalies in handling malformed utf8 input
Message ID:
On 04/12/2018 08:10 AM, Ricardo SIGNES via RT wrote:
> On Wed, 11 Apr 2018 14:35:20 -0700, wrote:
>> Using the options -CSD (-CD makes the special ARGV handle used by -n open
>> the passed filename with :utf8, -CS interprets the STDIN with :utf8) and
>> -Mutf8 (for the source code passed to -e) should make these examples
>> function as expected.
>> -Dan
> I'm not sure this is sufficient explanation.  Consider:
>    ~$ cat bad | perl -CAS -Mutf8 -lne 'print if /ąę/'
>    ~$ cat bad | perl -CAS -Mutf8 -lne 'print if /[ąę]/'
>    Malformed UTF-8 character (fatal) at -e line 1, <> line 1.
> Our input comes from stdin, and we have use -CS, which means STDIN is assumed UTF-8.  In both cases, we use -Mutf8.  We only see a fatal error in the second case, when we have used a character class instead of a string.

I'm not sure the file survived the email transfer intact, because I 
saved it and get a bunch of REPLACEMENT CHARACTERS, and so can't 
reproduce it.

But I know the reason one fails and the other doesn't.  Perl does not 
currently examine its input for utf8 validity unless the proper layer is 
used, which this isn't.  That is a source of frustration to both rjbs 
and me.

We also don't got out of our way to make validity checks as we execute. 
Those checks are only done if the result somehow depends on them.  If we 
can, for example, fail a match without needing to know the UTF-8 
validity of the target string, we do so, without slowing down everything 
while we check, perhaps for the umpteenth time, that the string is valid.

That is what is happening here, as you can see if you add -Dr.  As an 
aside, that is the first thing an experienced perl programmer should do 
when thinking there is a regex bug.

In the first case, you get this:
UTF-8 pattern and string...
Intuit: trying to determine minimum start position...
   doing 'check' fbm scan, [0..146] gave -1

   Did not find anchored substr "%x{105}%x{119}"...
Match rejected by optimizer

In this case, we can tell that the match will fail because we first use 
fast boyers moore for the 4 byte sequence that comprises the needed 
string.  It wasn't there, so no need to look in more detail.

The second case is different.
I get (with my sanitized input)
UTF-8 pattern and string...
Matching stclass ANYOF[0105 0119] against "    6  01/02 %" 
   %x{fffd}%x{fffd}%x{fffd}"... (146 bytes)
Contradicts stclass... [regexec_flags]
Match failed

In this case we don't do a byte scan, but have to examine the string in 
detail, and during that discover that it is malformed.

The fix for this is to fix :utf8 to do validity checking by default. 
We're not going to cripple perl's performance by adding validity checks 
where the outcome doesn't depend on validity.  And we're not going to 
make the code more complex by deciding, here we may be able to ignore 
that it's invalid, and press on.  To prevent segfaults and stuff, we 
have to refuse to handle invalid utf-8 when it matters.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About