develooper Front page | perl.perl5.porters | Postings from November 2008

Re: Matching multi-character folds

Thread Previous | Thread Next
karl williamson
November 24, 2008 11:20
Re: Matching multi-character folds
Message ID:
demerphq wrote:
>>> I personally consider character class notation to be an abbreviation
>>> of alternation. So a character class [xyz] is supposed to match the
>>> same thing as (x|y|z).  This implies that character classes have to be
>>> able to match more than one character under case-folding rules.  A lot
>>> of external logic and at least some internal logic operates under this
>>> assumption, so i dont think we can change it.
>> That sounds right.
> Im trying to imagine a way to do this that doesn't involve a pretty
> considerable redesign of how character classes work, and not coming up
> with much.
> Yves

I've only time right now to address this last point in your response. 
I'll look at the rest later.

  What I know is that regcomp.c attempts to handle some of this.  Here 
is a little of it starting at line 8324:
				  /* Any multicharacter foldings
				   * require the following transform:
				   * [ABCDEF] -> (?:[ABCabcDEFd]|pq|rst)
				   * where E folds into "pq" and F folds
				   * into "rst", all other characters
				   * fold to single characters.  We save
				   * away these multicharacter foldings,
				   * to be later saved as part of the
				   * additional "s" data. */
				  SV *sv;

				  if (!unicode_alternate)
				      unicode_alternate = newAV();
				  sv = newSVpvn_utf8((char*)foldbuf, foldlen,
				  av_push(unicode_alternate, sv);

But it's not working.  I never found the time to pursue it.  But perhaps 
you meant that it doesn't handle things like ß =~ /s{2}/

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About