develooper Front page | perl.perl5.porters | Postings from November 2008

Matching multi-character folds

Thread Next
karl williamson
November 22, 2008 19:46
Matching multi-character folds
Message ID:
This email is best viewed under utf8.

The Unicode standard lists several different cases where a character (or
code point if you prefer) should match a multiple character sequence 
when case is ignored.

One of these is the oft mentioned in this list, German lower case sharp
s or ß.  'ss' =~ /ß/i is true. (U+00DF)

And perl does currently work that way if and only if the ß is stored in 
utf8.  For the purposes of this email, I'm assuming all strings are in utf8.

In a recent email, Yves has said that he thinks it is debatable whether 
or not it should work this way.  My own view is that they should match, 
and it is beyond debate that the utf8ness of the strings should matter 
or not.  To quote from the perltodo: "The handling of Unicode is unclean 
in many places. For example, the regexp engine matches in Unicode 
semantics whenever the string or the pattern is flagged as UTF-8, but 
that should not be dependent on an internal storage detail of the 
string. Likewise, case folding behaviour is dependent on the UTF8 
internal flag being on or off."

To start the discussion about the multi-char folds, I give examples of 
the various types defined in the standard.  The first case is that of ß.

Another case is ligatures (they don't view ß as a ligature, and I don't
know why)  So 'fi' =~ /fi/i is true. (U+FB01)

Another case is where there there is no corresponding upper or title
case single precomposed character to a lower case one.  For instance
LATIN SMALL LETTER J WITH CARON, so 'ǰ' =~ /ǰ/i is true. (U+01F0)

Still another case is lower Greek letters with a iota-subscript or a
iota adscript.  I won't put in an example.

And the final cases all have to do with putting a combining dot above i
and j in Azeri, Turkish, and Lithuanian locales, which perl doesn't 
support in Unicode.

I think it is more correct for these things to match than not. 
However, I'm not so sure when things are put in a character class.  What 
should /[ß]/i match?  I'm tempted to say not 'ss' because character 
classes match only a single character.  But with the J with caron, that 
really is like a single character, with the caron really just a 
modifier.  For that I'm tempted to say yes 'ǰ' =~ /[ǰ]/i.  The problem 
is that the concept of a character class doesn't fit with the Unicode 
ideas.  I haven't done any research as to what other languages, etc do.

Would you like to know what happens today in perl?  Well I'll tell you 
anyway.  /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false.  In fact, every 
other multi-char fold returns false.  This in fact may be the only time 
in perl history, savor the moment, when the infamous ß gives an arguably 
more correct result than other characters.

Now the code in regcomp.c takes special pains to make all these match. 
But it doesn't work, except in the [ß] case.  So we don't have to worry 
about breaking existing code if we decide it should work differently.

Let's look at it the other direction.  Should ß =~ /ss/i ?  Should 'ǰ' 
=~ /ǰ/i ?  They both are true currently.  However, things like ß =~ 
/s{2}/i is false, and that seems inconsistent.

So, I'm not sure what the right answers are, but things are broken today.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About