Front page | perl.perl5.porters |
Postings from April 2010
Re: fold case matching
Thread Previous
|
Thread Next
From:
karl williamson
Date:
April 27, 2010 10:43
Subject:
Re: fold case matching
Message ID:
4BD7221D.2090305@khwilliamson.com
Dave Mitchell wrote:
> Just out of curiosity, which perl (if any) is doing the Right Thing
> as regards the following code, which matches a char that case folds to two
> chars:
>
> # lc("\x{149}") => "\x{2bc}N"
>
> print "ok PLAIN 1\n" if "\x{149}" =~ /\x{2bc}/i;
> print "ok PLAIN 2\n" if "\x{149}" =~ /N/i;
> print "ok PLAIN 3\n" if "\x{149}" =~ /\x{2bc}N/i;
>
> print "ok ALT 1\n" if "\x{149}" =~ /\x{2bc}|ZZZZ/i;
> print "ok ALT 2\n" if "\x{149}" =~ /N|ZZZZ/i;
> print "ok ALT 3\n" if "\x{149}" =~ /\x{2bc}N|ZZZZ/i;
>
>
> 5.8.0,
> 5.13.0,
> blead:
>
> ok PLAIN 3
> ok ALT 1
> ok ALT 3
>
> 5.10.0,
> 5.10.1,
> 5.12.0:
>
> ok PLAIN 1
> ok PLAIN 3
> ok ALT 1
> ok ALT 3
>
> (This is in the context me me trying to understand and fix the trie code
> for [perl #74484] Regex causing exponential runtime+mem usage.)
>
>
>
It appears to me that only case 3 should be matched in both instances.
The Unicode rule is simply that two strings match case insensitively iff
fold($s1) eq fold($s2).
My guess is that the improvement came from my very recent patch:
commit 7dcb3b25fc4113f0eeb68d0d3c47ccedd5ff3f2a
Author: Karl Williamson <khw@khw-desktop.(none)>
Date: Tue Apr 13 21:25:36 2010 -0600
* PATCH: [perl #72998] regex looping
which causes a partially matched character (as is U+0149 in this
instance) to not be a match. I don't know why it didn't fix the ALT 1
case, except read the next paragraph:
Let me say that the case insensitive matching in Perl of multi-char
folded characters is badly broken, and I'm sure always has been. It's
broken in many places. One big problem is that the optimizer doesn't
understand this possibility. Yves came up with the FOLDCHAR regnode
type to bypass the optimizer, but there are many more instances than it
addresses that cause problems. And it doesn't address the case where
the folded character is in the string to be matched, as opposed to be in
the pattern.
I started working on fixing the optimizer, but it was slow going. And I
stopped working on that when Yves sent a message that he was working on
a trie implementation of case insensitive matching. I had come to the
conclusion that band-aids were not going to fix this properly.
But, as I posted on this list not too long ago, there are semantic
issues with the concept, as in this example that Nicholas called 'evil':
> "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i;
>
> what would $1 and $2 be, and
> @LAST_MATCH_START, @LAST_MATCH_END?
Thread Previous
|
Thread Next