develooper Front page | perl.perl5.porters | Postings from April 2011

Re: Unicode regex negated case-insensitivity in 5.14.0-RC1

Thread Previous | Thread Next
From:
Karl Williamson
Date:
April 28, 2011 18:49
Subject:
Re: Unicode regex negated case-insensitivity in 5.14.0-RC1
Message ID:
4DBA18C0.8070301@khwilliamson.com
On 04/28/2011 06:32 PM, Tom Christiansen wrote:
>> Wouldn't backing out multichar folds for 5.14 introduce a regression?
>
> Specifically, it would break things like this, which already worked:
>
>      % perl5.12.0 -E 'say "\x{FB00}" =~ /ff/i || 0'
>      1
> 	...
>      % perl5.12.3 -E 'say "\x{FB00}" =~ /ff/i || 0'
>      1
>
> --tom
>

Yes it would.  My point was that appears to be where Unicode is headed. 
  But there are no guarantees that that is where they'll end up.

A middle position would be to disable them only in bracketed character 
classes.  I think that the most astonishment stems from those, when they 
are inverted.  This is where it was most buggy pre-5.14.  There were 
cases where it worked; but mostly it didn't.  And most of the cases 
where it worked were when the class got optimized into an EXACTF node. 
We'd have to worry about what to do with that situation now.  My 
position would be that we wouldn't do that optimization if the result 
would match multiple characters.

To state more clearly, I guess I'm now putting forth the idea that the 
least worst case for 5.14 is that we say that a bracketed character 
class can only match a single input character.  Most people expect that 
anyway, and it would have the fewest regressions.  Almost all 
regressions would be of the form that /[ß]/i would no longer mean the 
same thing as /ß/i.

The idea scares me of allowing a non-inverted class match multiple char 
folds vs an inverted one

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About