develooper Front page | perl.perl5.porters | Postings from April 2011

Re: Unicode regex negated case-insensitivity in 5.14.0-RC1

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
April 28, 2011 19:24
Subject:
Re: Unicode regex negated case-insensitivity in 5.14.0-RC1
Message ID:
6165.1304043818@chthon
Karl Williamson <public@khwilliamson.com> wrote
   on Thu, 28 Apr 2011 19:47:44 MDT: 

> Yes it would.  My point was that appears to be where Unicode is headed. 
> But there are no guarantees that that is where they'll end up.

I'm not entirely sure that they should, either.

> A middle position would be to disable them only in bracketed character 
> classes.  I think that the most astonishment stems from those, when they 
> are inverted.  This is where it was most buggy pre-5.14.  There were 
> cases where it worked; but mostly it didn't.  And most of the cases 
> where it worked were when the class got optimized into an EXACTF node. 
> We'd have to worry about what to do with that situation now.  My 
> position would be that we wouldn't do that optimization if the result 
> would match multiple characters.

> To state more clearly, I guess I'm now putting forth the idea that the 
> least worst case for 5.14 is that we say that a bracketed character 
> class can only match a single input character.  Most people expect that 
> anyway, and it would have the fewest regressions.  Almost all 
> regressions would be of the form that /[ß]/i would no longer mean the 
> same thing as /ß/i.

> The idea scares me of allowing a non-inverted class match multiple char 
> folds vs an inverted one

I have always been bugged by the idea that a bracketed character class
could every match more than a single code point.  It's like /./ suddenly
matching more than one, but you're not in grapheme mode.  Character classes
seem to be inherent singletons.  

It's because of this that we can't do certain kinds of lookbehinds anymore:

    % blead -E 'say "psst" =~ /(?<=[\x80-\xFF])t/ || 0'
    0

    % blead -E 'say "psst" =~ /(?<=[\x80-\xFF])t/i || 0'
    Variable length lookbehind not implemented in regex m/(?<=[\x80-\xFF])t/ at -e line 1.
    Exit 255

    % blead -E 'say "psst" =~ /(?<=[^\x80-\xFF])t/iaa || 0'
    1

And it's not the character class that's doing it, either; it's this:

    % blead -E 'say "psst" =~ /(?<=\xDF)t/ || 0'
    0

    % blead -E 'say "psst" =~ /(?<=\xDF)t/i || 0'
    Variable length lookbehind not implemented in regex m/(?<=\xdf)t/ at -e line 1.
    Exit 255

    % blead -E 'say "psst" =~ /(?<=\xDF)t/iaa || 0'
    0

So this is already a weirdness even without bringing a
character class into it, whether it's inverted or not.

That one at least has a possible fix.  You turn something like 

    (?<=\R)

into

    (?:(?<=\r\n)|(?<=\v))

just as you turn

    /(?<=\xDF)/i

into

    /(?:(?i:(?<=ss))|(?-i:((?<=\xDF)))/i

Perhaps I've strayed a bit from the matter at hand; I could certainly live
with no multichar folds in charclasses (positive or negative alike), but
multichar folds are still a bit of a curiosity, charitably put.  Even so,
I do not think we can dare consider breaking:

    % perl5.8.1 -le 'print "\x{FB00}" =~ /ff/i || 0'
    1

And I don't know why the Unicode folks might want (us) to at this point.

--tom

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About