develooper Front page | perl.perl5.porters | Postings from April 2011

Re: Unicode regex negated case-insensitivity in 5.14.0-RC1

Thread Previous | Thread Next
From:
Karl Williamson
Date:
April 28, 2011 16:34
Subject:
Re: Unicode regex negated case-insensitivity in 5.14.0-RC1
Message ID:
4DB9F910.7000108@khwilliamson.com
On 04/28/2011 05:27 PM, Karl Williamson wrote:
> On 04/28/2011 02:59 PM, Tom Christiansen wrote:
>> It's even weirder than that. Given:
>>
>> $\ = "\n";
>> my $x = "X-Xoqp-SDR-FpCqar4-Duooery-Faad-laeC_cCesspfpads:";
>> print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ || 0;
>> print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i || 0;
>> utf8::upgrade($x);
>> print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ || 0;
>> print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i || 0;
>>
>> Here are the results:
>>
>> % perl5.12.3 /tmp/bl
>> 1
>> 1
>> 1
>> 1
>>
>> % perl5.12.3 -M5.012 /tmp/bl
>> 1
>> 1
>> 1
>> 1
>>
>> % blead /tmp/bl
>> 1
>> 1
>> 1
>> 0
>>
>> % blead -M5.012 /tmp/bl
>> 1
>> 0
>> 1
>> 0
>>
>> So with (full) Unicode strings, it's yet again different still.
>>
>> With apologies to Philip K Dick :), this is a Karl-Thing, I think.
>>
>> --tom
>>
>
> Fortunately for my ego, the problem isn't in my code.
>
> Unfortunately for the project (and perhaps to my ego), the problem is
> much deeper; it is an issue with multi-character folds. The reason this
> doesn't match in 5.14 when full Unicode semantics is on (with or without
> utf8ness) is that in 5.14 for the first time, multi-char folds work.
>
> At least they work as designed. Perhaps there is a better design that
> wouldn't have the gotcha this gives. I don't know, and am open to
> suggestions. But Unicode is proposing to stop recommending that regular
> expressions engines accept them. This proposal stemmed, at least in
> part, to my pointing out issues to them about the feature. But their
> proposal wasn't mounted until around the feature freeze time of 5.14,
> after I had coded to what I thought were the correct specifications; and
> the comment period for the proposal is still going -- it ends this
> weekend. If you'd like to comment, see the document at
> http://unicode.org/reports/tr18/proposed.html
> After comments are over, they have to be evaluated, and will be
> presented to May's meeting of the Unicode Technical Committee, and who
> knows what will happen then.
>
> Let me quote from part of the motivation for the changes,found at
> http://unicode.org/review/pri179
>
> "There are a number of examples where the results would be
> counter-intuitive for typical users of regular expressions."
>
> I think by "typical users" they mean anyone who doesn't have the mind of
> a CPU. :)
>
> Anyway, what's going on here is that the regex appears to have been
> designed to match the graphic ASCII characters except the colon. But it
> is written so as to match the complement of the complement of those
> characters, with case-insensitivity thrown in. That means it is supposed
> to not match our old friend the German sharp ss, "ß". But that means it
> is not supposed to match the case fold of ß because we have /i matching.
> And that means it isn't supposed to match the sequence 'ss', which is
> the case fold of ß. And that means the match fails at the point in the
> above string where there is 'ss' in a row.
>
> That is counter-intuitive to me, but it is correct with the implemented
> regex rules, and it seems to me to be correct according to what the
> current Unicode TR18 says. Is there disagreement?
>
> What to do? I think this is a 5.14 blocker. And I'm thankful George
> found it now and not later. I wish I had a really good idea of how to
> proceed. My proposal, unless a better idea surfaces, is to just disable
> multi-character folds in regex matching for 5.14, which is the direction
> that Unicode appears to be moving. Multi-character folding worked
> somewhat in earlier releases, but was extremely buggy, and could not be
> relied on. Thus there are some backward compatibility issues, but we
> might have to do that anyway if Unicode proceeds as expected.
> It's actually quite easy to change the code to do this, as almost
> everything is driven off a mktables generated table. We just have to
> change a line or two in mktables to ignore the multi-char folds. There's
> also a line or two in regcomp, as the fold for ß has a special case
> there (for performance, to avoid having to look at the tables for
> non-utf8 patterns.). There's plenty of code that just won't get
> exercised, which could be #ifdef'd out, but that's not really necessary.
> The bigger amount of work is fixing the .t's.

And fixing the .pod's


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About