Front page | perl.perl5.porters |
Postings from April 2018
Re: [perl #133101] Anomalies in handling malformed utf8 input
Thread Previous
From:
Karl Williamson
Date:
April 12, 2018 15:38
Subject:
Re: [perl #133101] Anomalies in handling malformed utf8 input
Message ID:
2e918105-5380-4f1e-2ecf-fa7391a3d347@khwilliamson.com
On 04/12/2018 09:36 AM, Karl Williamson wrote:
> On 04/12/2018 08:10 AM, Ricardo SIGNES via RT wrote:
>> On Wed, 11 Apr 2018 14:35:20 -0700, grinnz@gmail.com wrote:
>>> Using the options -CSD (-CD makes the special ARGV handle used by -n
>>> open
>>> the passed filename with :utf8, -CS interprets the STDIN with :utf8) and
>>> -Mutf8 (for the source code passed to -e) should make these examples
>>> function as expected.
>>>
>>> -Dan
>>
>> I'm not sure this is sufficient explanation. Consider:
>>
>> ~$ cat bad | perl -CAS -Mutf8 -lne 'print if /ąę/'
>> ~$ cat bad | perl -CAS -Mutf8 -lne 'print if /[ąę]/'
>> Malformed UTF-8 character (fatal) at -e line 1, <> line 1.
>>
>> Our input comes from stdin, and we have use -CS, which means STDIN is
>> assumed UTF-8. In both cases, we use -Mutf8. We only see a fatal
>> error in the second case, when we have used a character class instead
>> of a string.
>>
>
> I'm not sure the file survived the email transfer intact, because I
> saved it and get a bunch of REPLACEMENT CHARACTERS, and so can't
> reproduce it.
>
> But I know the reason one fails and the other doesn't. Perl does not
> currently examine its input for utf8 validity unless the proper layer is
> used, which this isn't. That is a source of frustration to both rjbs
> and me.
>
> We also don't got out of our way to make validity checks as we execute.
> Those checks are only done if the result somehow depends on them. If we
> can, for example, fail a match without needing to know the UTF-8
> validity of the target string, we do so, without slowing down everything
> while we check, perhaps for the umpteenth time, that the string is valid.
>
> That is what is happening here, as you can see if you add -Dr. As an
> aside, that is the first thing an experienced perl programmer should do
> when thinking there is a regex bug.
>
> In the first case, you get this:
> UTF-8 pattern and string...
> Intuit: trying to determine minimum start position...
> doing 'check' fbm scan, [0..146] gave -1
>
> Did not find anchored substr "%x{105}%x{119}"...
> Match rejected by optimizer
>
> In this case, we can tell that the match will fail because we first use
> fast boyers moore for the 4 byte sequence that comprises the needed
> string. It wasn't there, so no need to look in more detail.
>
> The second case is different.
> I get (with my sanitized input)
> UTF-8 pattern and string...
> Matching stclass ANYOF[0105 0119] against " 6 01/02 %"
> %x{fffd}%x{fffd}%x{fffd}"... (146 bytes)
> Contradicts stclass... [regexec_flags]
> Match failed
>
> In this case we don't do a byte scan, but have to examine the string in
> detail, and during that discover that it is malformed.
>
> The fix for this is to fix :utf8 to do validity checking by default.
> We're not going to cripple perl's performance by adding validity checks
> where the outcome doesn't depend on validity. And we're not going to
> make the code more complex by deciding, here we may be able to ignore
> that it's invalid, and press on. To prevent segfaults and stuff, we
> have to refuse to handle invalid utf-8 when it matters.
>
I believe it's documented somewhere that you can have inconsistent
results with invalid UTF-8 input
Thread Previous