On 04/30/2011 04:00 PM, Nicholas Clark wrote:
> On Sat, Apr 30, 2011 at 03:31:23PM -0600, Tom Christiansen wrote:
>> Karl Williamson<public@khwilliamson.com> wrote
>> on Sat, 30 Apr 2011 15:20:08 MDT:
>>
>>> In thinking about this some more, given the bug that Nicholas found that
>>> affects all multi-character folds, not just \xdf, in character classes,
> ^^^^^^^^^^^^^^^^^^^^
>
>>> I think it would be best to just not offer any of them in 5.14.
>>
>> You mean undo something that's been there since 5.8?
>>
>> % perl5.8.0 -le 'print "\x{1FB2}" =~ /\x{1FB2}/i || 0'
>> 1
>> % perl5.8.0 -le 'print ucfirst("\x{1FB2}") =~ /\x{1FB2}/i || 0'
>> 1
>> % perl5.8.0 -le 'print uc("\x{1FB2}") =~ /\x{1FB2}/i || 0'
>> 1
>
>> Or did you mean something else?
>
> I think you missed the "in character classes" part of Karl's thought.
> Your examples don't use []
Precisely. So those examples would not be broken. Only bracketed
character classes in regular expressions. You said a couple days ago
that "I have always been bugged by the idea that a bracketed character
class could ever match more than a single code point. It's like /./
suddenly matching more than one, but you're not in grapheme mode.
Character classes seem to be inherent singletons." So it appeared that
you agreed with me.
Multi-char folds in bracketed character classes did not work in 5.10,
and I presume earlier, though Yves was surprised at the time that they
didn't. I'm the one who filed a trouble ticket on the issue, and put in
what turned out to be a very partial fix for 5.10.1. They still don't
work right in 5.14, given the flaw that Nicholas found. (That could be
fixed for some dot release.)
>
> I'm still not sure *what* I think.
>
> But *if* a class consisting of a single character is always equivalent to a
> literal of that character (ie /[a]/ is /a/, /[ß]/ is /ß/, /[ß]/i is /ß/i,
> etc), one of the things I'm not about is whether it's better to say "no
> multi character folds in character classes" or "no multi character folds in
> character classes, except classes consisting of exactly one character". I
> think (I think) that it's useful to maintain that explicit correspondence,
> as (IIRC) Yves worked to get the engine to optimise /[a]/ to /a/ and /[.]/ to
> /\./, as it was a common idiom in some circles to use regexp character class
> syntax as an alternative to backslash quoting.
>
> The downside, obviously, is that (for starters) it's more complex to explain.
I just realized that this is mostly a red herring. I think it was me
who brought it up, and I apologize. Only Latin1 code points have ever
been optimized this way. The only Latin1 code point that has a multi
character fold is ß. In 5.12, a /[ß]/i was optimized into an EXACTF
node. But this is one of the tricky folds, which fails Ilya's optimizer
tests. Thus almost certainly /[ß]/i would not work in 5.12. Therefore
we are not introducing a regression if we don't have it work in 5.14.
The other multi-char cases of single-characters in classes are
non-Latin1 and have never been optimized. Thus, they didn't work in
5.10 (and I presume earlier), and only under rare circumstances through
5.12, and I don't know what those circumstances are now.
% perl5.12.2 -E 'say "fi" =~ /[\N{U+FB01}]/i || 0'
0
So we aren't introducing much of any regressions if we don't have these
work in 5.14. So the single code point vs multiple code point issue is
not an issue.
>
> Digression:
>
> Because as a general rule, rightly or wrongly on my part
I think it is rightly.
, I feel that it's
> unfortunate if two or more different syntax choices for the same action
> produce notably different performance because they trigger different
> runtime implementations, where both
>
> a: one is unambiguously always slower than the other
> b: it would be possible for the compile time implementation to automatically
> select the faster implementation, whichever syntax was used
>
>
> because that way
>
> a: all existing code goes faster without change
> b: it kills dead style arguments based on "but this one is more efficient"
> letting people pick style based on clarity (or their opinions of clarity)
>
>
> (eg reverse sort ...; is now internally optimised to tell sort to sort in
> reverse, so no slower than sort {$b cmp $a} ...; but usually somewhat clearer)
>
>
> Nicholas Clark
>
Another digression: in 5.14, I added the optimization that classes of
the form [Bb] with exactly two Latin1 code points where the two are
folds of each other get optimized into EXACTFish nodes. This isn't the
case for [Kk] because of the Kelvin sign being part of the fold equation.
Thread Previous
|
Thread Next