develooper Front page | perl.perl5.porters | Postings from November 2008

More on character folding

karl williamson
November 28, 2008 12:11
More on character folding
Message ID:
Here are more details on one class of problem with character fold 
matching in Unicode.  It is most simply illustrated with U+017F, the 
LATIN SMALL LETTER LONG S.  It is an alternate font for 's', looking 
kind of like an 'f'.  Things like the USA Declaration of Independence 
were written using this symbol.  According to the notes from Unicode, it 
is still in current use in Gaelic and Fraktur.  Anyway, according to the 
Unicode standard it should loosely match 's'; Perl equates this type of 
matching to /i matching.  And it mostly does.

print __LINE__, " ", ("s" =~ /\x{017F}/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("S" =~ /\x{017F}/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("s" =~ /\x{017F}+/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("s" =~ /\x{017F}{1}/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("\x{017F}" =~ /s/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("\x{017F}" =~ /S/i ? "yes" : "no"), "\n";

all print yes.  But,

print __LINE__, " ", ("\x{017F}" =~ /s+/i ? "yes" : "no"), "\n";
print __LINE__, " ", ("\x{017F}" =~ /s{1}/i ? "yes" : "no"), "\n";

both print no.  I think that there are 166 code points in Unicode 5.1 
which may have problems like this when they appear in a pattern.  A file 
listing these is attached.

In looking at re debugging output of these, this with my very limited 
knowledge, it appears that the {1} could be a result of the optimizer, 
as it doesn't seem to know that the 017F represents one character, but 
thinks it is length 2, but the one with a +, I'm not sure of the cause. 
  If they're not the optimizer, an idea I'm just throwing out with no 
expertise to back it up, is one could create a list at perl compile time 
of the ones that cause errors and either use a special node, or add a 
flag to the EXACTF node for them. Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About