develooper Front page | perl.perl5.porters | Postings from April 2010

Re: fold case matching

Thread Previous | Thread Next
karl williamson
April 27, 2010 10:43
Re: fold case matching
Message ID:
Dave Mitchell wrote:
> Just out of curiosity, which perl (if any) is doing the Right Thing
> as regards the following code, which matches a char that case folds to two
> chars:
>     # lc("\x{149}") => "\x{2bc}N"
>     print "ok PLAIN 1\n" if "\x{149}" =~ /\x{2bc}/i;
>     print "ok PLAIN 2\n" if "\x{149}" =~ /N/i;
>     print "ok PLAIN 3\n" if "\x{149}" =~ /\x{2bc}N/i;
>     print "ok ALT   1\n" if "\x{149}" =~ /\x{2bc}|ZZZZ/i;
>     print "ok ALT   2\n" if "\x{149}" =~ /N|ZZZZ/i;
>     print "ok ALT   3\n" if "\x{149}" =~ /\x{2bc}N|ZZZZ/i;
> 5.8.0,
> 5.13.0,
> blead:
>     ok PLAIN 3
>     ok ALT   1
>     ok ALT   3
> 5.10.0,
> 5.10.1,
> 5.12.0:
>     ok PLAIN 1
>     ok PLAIN 3
>     ok ALT   1
>     ok ALT   3
> (This is in the context me me trying to understand and fix the trie code
> for [perl #74484] Regex causing exponential runtime+mem usage.)

It appears to me that only case 3 should be matched in both instances. 
The Unicode rule is simply that two strings match case insensitively iff 
fold($s1) eq fold($s2).

My guess is that the improvement came from my very recent patch:
  commit 7dcb3b25fc4113f0eeb68d0d3c47ccedd5ff3f2a
  Author: Karl Williamson <khw@khw-desktop.(none)>
  Date:   Tue Apr 13 21:25:36 2010 -0600

*    PATCH: [perl #72998] regex looping

which causes a partially matched character (as is U+0149 in this 
instance) to not be a match.  I don't know why it didn't fix the ALT 1 
case, except read the next paragraph:

Let me say that the case insensitive matching in Perl of multi-char 
folded characters is badly broken, and I'm sure always has been.  It's 
broken in many places.  One big problem is that the optimizer doesn't 
understand this possibility.  Yves came up with the FOLDCHAR regnode 
type to bypass the optimizer, but there are many more instances than it 
addresses that cause problems.  And it doesn't address the case where 
the folded character is in the string to be matched, as opposed to be in 
the pattern.

I started working on fixing the optimizer, but it was slow going.  And I 
stopped working on that when Yves sent a message that he was working on 
a trie implementation of case insensitive matching.  I had come to the 
conclusion that band-aids were not going to fix this properly.

But, as I posted on this list not too long ago, there are semantic 
issues with the concept, as in this example that Nicholas called 'evil':

 > "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i;
 >  what would $1 and $2 be, and

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About