develooper Front page | perl.perl5.porters | Postings from November 2008

Re: Matching multi-character folds

Thread Previous | Thread Next
From:
karl williamson
Date:
November 24, 2008 11:49
Subject:
Re: Matching multi-character folds
Message ID:
492B0541.7000106@khwilliamson.com
karl williamson wrote:
> demerphq wrote:
> [snip]
>>>>
>>>> I personally consider character class notation to be an abbreviation
>>>> of alternation. So a character class [xyz] is supposed to match the
>>>> same thing as (x|y|z).  This implies that character classes have to be
>>>> able to match more than one character under case-folding rules.  A lot
>>>> of external logic and at least some internal logic operates under this
>>>> assumption, so i dont think we can change it.
>>>>
>>> That sounds right.
>>
>> Im trying to imagine a way to do this that doesn't involve a pretty
>> considerable redesign of how character classes work, and not coming up
>> with much.
>>
>> Yves
>>
> 
> I've only time right now to address this last point in your response. 
> I'll look at the rest later.
> 
>  What I know is that regcomp.c attempts to handle some of this.  Here is 
> a little of it starting at line 8324:
>                   /* Any multicharacter foldings
>                    * require the following transform:
>                    * [ABCDEF] -> (?:[ABCabcDEFd]|pq|rst)
>                    * where E folds into "pq" and F folds
>                    * into "rst", all other characters
>                    * fold to single characters.  We save
>                    * away these multicharacter foldings,
>                    * to be later saved as part of the
>                    * additional "s" data. */
>                   SV *sv;
> 
>                   if (!unicode_alternate)
>                       unicode_alternate = newAV();
>                   sv = newSVpvn_utf8((char*)foldbuf, foldlen,
>                              TRUE);
>                   av_push(unicode_alternate, sv);
> 
> But it's not working.  I never found the time to pursue it.  But perhaps 
> you meant that it doesn't handle things like ß =~ /s{2}/
> 
> 
And, another idea that might be helpful.  I looked up the discussion in 
this list's archives about tricky folds, and someone suggested an idea 
that I also had been thinking of independently, and it didn't look like 
there was any response to his idea.  And that was in effect to instead 
of using  trickyfold, to pretend for the tricky fold characters that the 
input was a mapping of them.  For ß, for example, pretend it was 
(?:ß|[Ss][Ss]|\x{1e9e}).  Then the optimizer wouldn't have to be fooled.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About