develooper Front page | perl.perl5.porters | Postings from May 2003

Re: another attempt at adding unicode regex support to perl

Thread Previous
From:
Jeff 'japhy' Pinyan
Date:
May 29, 2003 00:26
Subject:
Re: another attempt at adding unicode regex support to perl
Message ID:
Pine.LNX.4.44.0305290318190.4803-100000@perlmonk.org
On May 29, Jeff 'japhy' Pinyan said:

>On May 27, Jeff 'japhy' Pinyan said:
>
>>I've been trying to add Unicode's regex charclass-magic to Perl (again),
>>and I think I've run into the same problem as before.
>
>I've hit a snag.  If a charclass has intersection or subtraction in it,
>and locale is on or \p{...} classes are used, the charclass must be
>represented ENTIRELY as those "+utf8::XXX" strings.  Here's why.
>
>If locale is on, then a charclass like [[\w&&[\d]][aeiou]] will have to be
>represented as "+utf8::IsAlnum &utf::IsDigit +utf8::Is_aeiou_" (or
>something like that), because (since locale is on) \w doesn't modify the
>charclass's bitmap array, but just turns on the ANYOF_ALNUM flag.  Since
>precedence is an issue, we can't just check flags.
>
>This will mean we'll be suffering some inefficiency (but that should be
>expected with Unicode right now, right?).  It also means I need to come up
>with on-the-fly Unicode classes that match a specific set of characters I
>decide on at that moment.  What's the easiest way to do that?  I need to
>know this to get intersection and subtraction working.

I found another problem with utf8_heavy.pl's operation.  It separates the
+utf8::xxx lines from the 0123\t0456 lines when evaluating whether a
character matches a UTF description.  Example:

  sub InJaphy {
    << "END";
  +utf8::IsAlnum
  -utf8::IsDigit
  0031\t0033
  END
  }

should match any alphanum that ISN'T a digit, except that 1, 2, and 3 are
ok to match.  It should match "abc123def" out of "abc123def456ghi", but it
doesn't, because the 0031\t0033 line is placed in the swash first, and
THEN the +utf8 and -utf8 lines are parsed.  I don't think this is defined
well anywhere in the docs (that I could find).  I can fix this (and I
think it should be fixed).  Does anyone agree with me?

-- 
Jeff "japhy" Pinyan      japhy@pobox.com      http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
<stu> what does y/// stand for?  <tenderpuss> why, yansliterate of course.
[  I'm looking for programming work.  If you like my work, let me know.  ]



Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About