develooper Front page | perl.perl5.porters | Postings from April 2018

Re: [perl #133101] Anomalies in handling malformed utf8 input

Thread Previous
From:
Karl Williamson
Date:
April 12, 2018 15:38
Subject:
Re: [perl #133101] Anomalies in handling malformed utf8 input
Message ID:
2e918105-5380-4f1e-2ecf-fa7391a3d347@khwilliamson.com
On 04/12/2018 09:36 AM, Karl Williamson wrote:
> On 04/12/2018 08:10 AM, Ricardo SIGNES via RT wrote:
>> On Wed, 11 Apr 2018 14:35:20 -0700, grinnz@gmail.com wrote:
>>> Using the options -CSD (-CD makes the special ARGV handle used by -n 
>>> open
>>> the passed filename with :utf8, -CS interprets the STDIN with :utf8) and
>>> -Mutf8 (for the source code passed to -e) should make these examples
>>> function as expected.
>>>
>>> -Dan
>>
>> I'm not sure this is sufficient explanation.  Consider:
>>
>>    ~$ cat bad | perl -CAS -Mutf8 -lne 'print if /ąę/'
>>    ~$ cat bad | perl -CAS -Mutf8 -lne 'print if /[ąę]/'
>>    Malformed UTF-8 character (fatal) at -e line 1, <> line 1.
>>
>> Our input comes from stdin, and we have use -CS, which means STDIN is 
>> assumed UTF-8.  In both cases, we use -Mutf8.  We only see a fatal 
>> error in the second case, when we have used a character class instead 
>> of a string.
>>
> 
> I'm not sure the file survived the email transfer intact, because I 
> saved it and get a bunch of REPLACEMENT CHARACTERS, and so can't 
> reproduce it.
> 
> But I know the reason one fails and the other doesn't.  Perl does not 
> currently examine its input for utf8 validity unless the proper layer is 
> used, which this isn't.  That is a source of frustration to both rjbs 
> and me.
> 
> We also don't got out of our way to make validity checks as we execute. 
> Those checks are only done if the result somehow depends on them.  If we 
> can, for example, fail a match without needing to know the UTF-8 
> validity of the target string, we do so, without slowing down everything 
> while we check, perhaps for the umpteenth time, that the string is valid.
> 
> That is what is happening here, as you can see if you add -Dr.  As an 
> aside, that is the first thing an experienced perl programmer should do 
> when thinking there is a regex bug.
> 
> In the first case, you get this:
> UTF-8 pattern and string...
> Intuit: trying to determine minimum start position...
>    doing 'check' fbm scan, [0..146] gave -1
> 
>    Did not find anchored substr "%x{105}%x{119}"...
> Match rejected by optimizer
> 
> In this case, we can tell that the match will fail because we first use 
> fast boyers moore for the 4 byte sequence that comprises the needed 
> string.  It wasn't there, so no need to look in more detail.
> 
> The second case is different.
> I get (with my sanitized input)
> UTF-8 pattern and string...
> Matching stclass ANYOF[0105 0119] against "    6  01/02 %"   
> %x{fffd}%x{fffd}%x{fffd}"... (146 bytes)
> Contradicts stclass... [regexec_flags]
> Match failed
> 
> In this case we don't do a byte scan, but have to examine the string in 
> detail, and during that discover that it is malformed.
> 
> The fix for this is to fix :utf8 to do validity checking by default. 
> We're not going to cripple perl's performance by adding validity checks 
> where the outcome doesn't depend on validity.  And we're not going to 
> make the code more complex by deciding, here we may be able to ignore 
> that it's invalid, and press on.  To prevent segfaults and stuff, we 
> have to refuse to handle invalid utf-8 when it matters.
> 

I believe it's documented somewhere that you can have inconsistent 
results with invalid UTF-8 input

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About