Front page | perl.perl5.porters |
Postings from May 2010
Re: fold case matching
Thread Previous
|
Thread Next
From:
Dave Mitchell
Date:
May 3, 2010 08:58
Subject:
Re: fold case matching
Message ID:
20100503155809.GC26313@iabyn.com
On Tue, Apr 27, 2010 at 11:42:53AM -0600, karl williamson wrote:
> Dave Mitchell wrote:
>> Just out of curiosity, which perl (if any) is doing the Right Thing
>> as regards the following code, which matches a char that case folds to two
>> chars:
[snip]
> It appears to me that only case 3 should be matched in both instances.
> The Unicode rule is simply that two strings match case insensitively iff
> fold($s1) eq fold($s2).
>
> My guess is that the improvement came from my very recent patch:
> commit 7dcb3b25fc4113f0eeb68d0d3c47ccedd5ff3f2a
> Author: Karl Williamson <khw@khw-desktop.(none)>
> Date: Tue Apr 13 21:25:36 2010 -0600
>
> * PATCH: [perl #72998] regex looping
>
> which causes a partially matched character (as is U+0149 in this
> instance) to not be a match. I don't know why it didn't fix the ALT 1
> case, except read the next paragraph:
>
> Let me say that the case insensitive matching in Perl of multi-char
> folded characters is badly broken, and I'm sure always has been. It's
> broken in many places. One big problem is that the optimizer doesn't
> understand this possibility. Yves came up with the FOLDCHAR regnode
> type to bypass the optimizer, but there are many more instances than it
> addresses that cause problems. And it doesn't address the case where
> the folded character is in the string to be matched, as opposed to be in
> the pattern.
>
> I started working on fixing the optimizer, but it was slow going. And I
> stopped working on that when Yves sent a message that he was working on
> a trie implementation of case insensitive matching. I had come to the
> conclusion that band-aids were not going to fix this properly.
>
> But, as I posted on this list not too long ago, there are semantic
> issues with the concept, as in this example that Nicholas called 'evil':
>
> > "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i;
> >
> > what would $1 and $2 be, and
> > @LAST_MATCH_START, @LAST_MATCH_END?
Ah, a can of worms!
I've noticed some other similar things, e.g.:
$ p -le 'print "matched" if "a\x{df}\x{100}" =~ /(aS|xx)S/i'
$ p -le 'print "matched" if "a\x{df}\x{100}" =~ /(a|xx)SS/i'
matched
$
It looks like the temporary fold buffer used within the TRIE: maybe should
be made available to the rest of the S_regmatch loop ???
Anyway, I decided while working on my trie fix (just committed), that I
would just leave folding as-is, and let someone else worry about it some
time!
--
The Enterprise is captured by a vastly superior alien intelligence which
does not put them on trial.
-- Things That Never Happen in "Star Trek" #10
Thread Previous
|
Thread Next