Front page | perl.perl5.porters |
Postings from November 2008
More on character folding
From:
karl williamson
Date:
November 28, 2008 12:11
Subject:
More on character folding
Message ID:
49305073.4000907@khwilliamson.com
Here are more details on one class of problem with character fold
matching in Unicode. It is most simply illustrated with U+017F, the
LATIN SMALL LETTER LONG S. It is an alternate font for 's', looking
kind of like an 'f'. Things like the USA Declaration of Independence
were written using this symbol. According to the notes from Unicode, it
is still in current use in Gaelic and Fraktur. Anyway, according to the
Unicode standard it should loosely match 's'; Perl equates this type of
matching to /i matching. And it mostly does.
print __LINE__, " ", ("s" =~ /\x{017F}/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("S" =~ /\x{017F}/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("s" =~ /\x{017F}+/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("s" =~ /\x{017F}{1}/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("\x{017F}" =~ /s/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("\x{017F}" =~ /S/i ? "yes" : "no"), "\n";
all print yes. But,
print __LINE__, " ", ("\x{017F}" =~ /s+/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("\x{017F}" =~ /s{1}/i ? "yes" : "no"), "\n";
both print no. I think that there are 166 code points in Unicode 5.1
which may have problems like this when they appear in a pattern. A file
listing these is attached.
In looking at re debugging output of these, this with my very limited
knowledge, it appears that the {1} could be a result of the optimizer,
as it doesn't seem to know that the 017F represents one character, but
thinks it is length 2, but the one with a +, I'm not sure of the cause.
If they're not the optimizer, an idea I'm just throwing out with no
expertise to back it up, is one could create a list at perl compile time
of the ones that cause errors and either use a special node, or add a
flag to the EXACTF node for them.
-
More on character folding
by karl williamson