develooper Front page | perl.perl5.porters | Postings from April 2011

Re: Unicode regex negated case-insensitivity in 5.14.0-RC1

Thread Previous | Thread Next
From:
Karl Williamson
Date:
April 30, 2011 16:30
Subject:
Re: Unicode regex negated case-insensitivity in 5.14.0-RC1
Message ID:
4DBC9B11.5070000@khwilliamson.com
On 04/30/2011 04:00 PM, Nicholas Clark wrote:
> On Sat, Apr 30, 2011 at 03:31:23PM -0600, Tom Christiansen wrote:
>> Karl Williamson<public@khwilliamson.com>  wrote
>>     on Sat, 30 Apr 2011 15:20:08 MDT:
>>
>>> In thinking about this some more, given the bug that Nicholas found that
>>> affects all multi-character folds, not just \xdf,  in character classes,
>                                                         ^^^^^^^^^^^^^^^^^^^^
>
>>> I think it would be best to just not offer any of them in 5.14.
>>
>> You mean undo something that's been there since 5.8?
>>
>>      % perl5.8.0 -le 'print "\x{1FB2}" =~ /\x{1FB2}/i || 0'
>>      1
>>      % perl5.8.0 -le 'print ucfirst("\x{1FB2}") =~ /\x{1FB2}/i || 0'
>>      1
>>      % perl5.8.0 -le 'print uc("\x{1FB2}") =~ /\x{1FB2}/i || 0'
>>      1
>
>> Or did you mean something else?
>
> I think you missed the "in character classes" part of Karl's thought.
> Your examples don't use []


Precisely.  So those examples would not be broken.  Only bracketed 
character classes in regular expressions.  You said a couple days ago 
that "I have always been bugged by the idea that a bracketed character 
class could ever match more than a single code point.  It's like /./ 
suddenly matching more than one, but you're not in grapheme mode. 
Character classes seem to be inherent singletons."  So it appeared that 
you agreed with me.

Multi-char folds in bracketed character classes did not work in 5.10, 
and I presume earlier, though Yves was surprised at the time that they 
didn't.  I'm the one who filed a trouble ticket on the issue, and put in 
what turned out to be a very partial fix for 5.10.1.  They still don't 
work right in 5.14, given the flaw that Nicholas found.  (That could be 
fixed for some dot release.)

>
> I'm still not sure *what* I think.
>
> But *if* a class consisting of a single character is always equivalent to a
> literal of that character (ie /[a]/ is /a/, /[ß]/ is /ß/, /[ß]/i is /ß/i,
> etc), one of the things I'm not about is whether it's better to say "no
> multi character folds in character classes" or "no multi character folds in
> character classes, except classes consisting of exactly one character". I
> think (I think) that it's useful to maintain that explicit correspondence,
> as (IIRC) Yves worked to get the engine to optimise /[a]/ to /a/ and /[.]/ to
> /\./, as it was a common idiom in some circles to use regexp character class
> syntax as an alternative to backslash quoting.
>
> The downside, obviously, is that (for starters) it's more complex to explain.

I just realized that this is mostly a red herring.  I think it was me 
who brought it up, and I apologize.  Only Latin1 code points have ever 
been optimized this way.  The only Latin1 code point that has a multi 
character fold is ß.  In 5.12, a /[ß]/i was optimized into an EXACTF 
node.  But this is one of the tricky folds, which fails Ilya's optimizer 
tests.  Thus almost certainly /[ß]/i would not work in 5.12.  Therefore 
we are not introducing a regression if we don't have it work in 5.14.

The other multi-char cases of single-characters in classes are 
non-Latin1 and have never been optimized.  Thus, they didn't work in 
5.10 (and I presume earlier), and only under rare circumstances through 
5.12, and I don't know what those circumstances are now.

% perl5.12.2 -E 'say "fi" =~ /[\N{U+FB01}]/i || 0'
0

So we aren't introducing much of any regressions if we don't have these 
work in 5.14.  So the single code point vs multiple code point issue is 
not an issue.

>
> Digression:
>
> Because as a general rule, rightly or wrongly on my part

I think it is rightly.


, I feel that it's
> unfortunate if two or more different syntax choices for the same action
> produce notably different performance because they trigger different
> runtime implementations, where both
>
> a: one is unambiguously always slower than the other
> b: it would be possible for the compile time implementation to automatically
>     select the faster implementation, whichever syntax was used
>
>
> because that way
>
> a: all existing code goes faster without change
> b: it kills dead style arguments based on "but this one is more efficient"
>     letting people pick style based on clarity (or their opinions of clarity)
>
>
> (eg reverse sort ...; is now internally optimised to tell sort to sort in
> reverse, so no slower than sort {$b cmp $a} ...; but usually somewhat clearer)
>
>
> Nicholas Clark
>

Another digression: in 5.14, I added the optimization that classes of 
the form [Bb] with exactly two Latin1 code points where the two are 
folds of each other get optimized into EXACTFish nodes.  This isn't the 
case for [Kk] because of the Kelvin sign being part of the fold equation.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About