develooper Front page | perl.perl5.porters | Postings from November 2008

Matching multi-character folds

Thread Previous
karl williamson
November 22, 2008 19:59
Matching multi-character folds
Message ID:
This email is best viewed under utf8.

The Unicode standard lists several different cases where a character (or
code point if you prefer) should match a multiple character sequence 
when case is ignored.

One of these is the oft mentioned in this list, German lower case sharp
s or ß.  'ss' =~ /ß/i is true. (U+00DF)

And perl does currently work that way if and only if the ß is stored in 
utf8.  For the purposes of this email, I'm assuming all strings are in utf8.

In a recent email, Yves has said that he thinks it is debatable whether 
or not it should work this way.  My own view is that they should match. 
It is beyond debate if the utf8ness of the strings should matter or not. 
  To quote from the perltodo: "The handling of Unicode is unclean in 
many places. For example, the regexp engine matches in Unicode semantics 
whenever the string or the pattern is flagged as UTF-8, but that should 
not be dependent on an internal storage detail of the string. Likewise, 
case folding behaviour is dependent on the UTF8 internal flag being on 
or off."

Yves has submitted an RFC for the first part of that statement, and I'm 
now going to talk about the second.  I believe we have established that 
there will be a new mode of operation which will become the default in 
5.12 that characters in the 128-255 range will case fold match as the 
Unicode standard says.  But there are some issues with multi-char folds 
(the only one in that range being ß) generally.

To start the discussion about the multi-char folds, I give examples of 
the various types defined in the standard.  The first type is that of ß.

Another type is ligatures (they don't view ß as a ligature, and I don't
know why)  So 'fi' =~ /fi/i is true. (U+FB01)

Another type is where there there is no corresponding upper or title
case single precomposed character corresponding to a lower case one. 
For instance LATIN SMALL LETTER J WITH CARON, so 'ǰ' =~ /ǰ/i is true. 

Still another type is lower Greek letters with a iota-subscript or a
iota adscript.  I won't put in an example.

And the final types all have to do with putting a combining dot above i
and j in Azeri, Turkish, and Lithuanian locales, which perl doesn't 
support in Unicode.

I think it is more correct for these things to match than not. 
However, I'm not so sure when things are put in a character class.  What 
should /[ß]/i match?  I'm tempted to say not 'ss' because character 
classes match only a single character.  But with the J with caron, that 
really is like a single character, with the caron just a modifier.  For 
that I'm tempted to say yes 'ǰ' =~ /[ǰ]/i.  The problem is that the 
concept of a character class doesn't fit with the Unicode ideas.  I 
haven't done any research as to what other languages, etc do.

Would you like to know what happens today in perl?  Well I'll tell you 
anyway.  /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false.  In fact, every 
other multi-char case ignored fold returns false.  This in fact may be 
the only time in perl history, savor the moment, when the infamous ß 
gives an arguably more correct result than other characters.

The code in regcomp.c takes special pains to make all these match.  But 
it doesn't work, except in the [ß] case.  So we don't have to worry 
about breaking existing code if we decide it should work differently.

Let's look at it the other direction.  Should ß =~ /ss/i ?  Should 'ǰ' 
=~ /ǰ/i ?  They both are true currently.  However, things like ß =~ 
/s{2}/i is false, and that seems inconsistent.

So, I'm not sure what the right answers are, but things are somewhat 
broken today, and I'd like to get clarity on how it should work.

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About