develooper Front page | perl.perl5.porters | Postings from May 2010

Re: fold case matching

Thread Previous | Thread Next
From:
Dave Mitchell
Date:
May 3, 2010 08:58
Subject:
Re: fold case matching
Message ID:
20100503155809.GC26313@iabyn.com
On Tue, Apr 27, 2010 at 11:42:53AM -0600, karl williamson wrote:
> Dave Mitchell wrote:
>> Just out of curiosity, which perl (if any) is doing the Right Thing
>> as regards the following code, which matches a char that case folds to two
>> chars:
[snip]
> It appears to me that only case 3 should be matched in both instances.  
> The Unicode rule is simply that two strings match case insensitively iff  
> fold($s1) eq fold($s2).
>
> My guess is that the improvement came from my very recent patch:
>  commit 7dcb3b25fc4113f0eeb68d0d3c47ccedd5ff3f2a
>  Author: Karl Williamson <khw@khw-desktop.(none)>
>  Date:   Tue Apr 13 21:25:36 2010 -0600
>
> *    PATCH: [perl #72998] regex looping
>
> which causes a partially matched character (as is U+0149 in this  
> instance) to not be a match.  I don't know why it didn't fix the ALT 1  
> case, except read the next paragraph:
>
> Let me say that the case insensitive matching in Perl of multi-char  
> folded characters is badly broken, and I'm sure always has been.  It's  
> broken in many places.  One big problem is that the optimizer doesn't  
> understand this possibility.  Yves came up with the FOLDCHAR regnode  
> type to bypass the optimizer, but there are many more instances than it  
> addresses that cause problems.  And it doesn't address the case where  
> the folded character is in the string to be matched, as opposed to be in  
> the pattern.
>
> I started working on fixing the optimizer, but it was slow going.  And I  
> stopped working on that when Yves sent a message that he was working on  
> a trie implementation of case insensitive matching.  I had come to the  
> conclusion that band-aids were not going to fix this properly.
>
> But, as I posted on this list not too long ago, there are semantic  
> issues with the concept, as in this example that Nicholas called 'evil':
>
> > "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i;
> >
> >  what would $1 and $2 be, and
> > @LAST_MATCH_START, @LAST_MATCH_END?

Ah, a can of worms!

I've noticed some other similar things, e.g.:

    $ p -le 'print "matched" if "a\x{df}\x{100}" =~ /(aS|xx)S/i'
    $ p -le 'print "matched" if "a\x{df}\x{100}" =~ /(a|xx)SS/i'
    matched
    $

It looks like the temporary fold buffer used within the TRIE: maybe should
be made available to the rest of the S_regmatch loop ???

Anyway, I decided while working on my trie fix (just committed), that I
would just leave folding as-is, and let someone else worry about it some
time!

-- 
The Enterprise is captured by a vastly superior alien intelligence which
does not put them on trial.
    -- Things That Never Happen in "Star Trek" #10

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About