Front page | perl.perl5.porters |
Postings from November 2008
Matching multi-character folds
Thread Next
From:
karl williamson
Date:
November 22, 2008 19:59
Subject:
Matching multi-character folds
Message ID:
4928D4EE.3030907@khwilliamson.com
This email is best viewed under utf8.
The Unicode standard lists several different cases where a character (or
code point if you prefer) should match a multiple character sequence
when case is ignored.
One of these is the oft mentioned in this list, German lower case sharp
s or ß. 'ss' =~ /ß/i is true. (U+00DF)
And perl does currently work that way if and only if the ß is stored in
utf8. For the purposes of this email, I'm assuming all strings are in utf8.
In a recent email, Yves has said that he thinks it is debatable whether
or not it should work this way. My own view is that they should match.
It is beyond debate if the utf8ness of the strings should matter or not.
To quote from the perltodo: "The handling of Unicode is unclean in
many places. For example, the regexp engine matches in Unicode semantics
whenever the string or the pattern is flagged as UTF-8, but that should
not be dependent on an internal storage detail of the string. Likewise,
case folding behaviour is dependent on the UTF8 internal flag being on
or off."
Yves has submitted an RFC for the first part of that statement, and I'm
now going to talk about the second. I believe we have established that
there will be a new mode of operation which will become the default in
5.12 that characters in the 128-255 range will case fold match as the
Unicode standard says. But there are some issues with multi-char folds
(the only one in that range being ß) generally.
To start the discussion about the multi-char folds, I give examples of
the various types defined in the standard. The first type is that of ß.
Another type is ligatures (they don't view ß as a ligature, and I don't
know why) So 'fi' =~ /fi/i is true. (U+FB01)
Another type is where there there is no corresponding upper or title
case single precomposed character corresponding to a lower case one.
For instance LATIN SMALL LETTER J WITH CARON, so 'ǰ' =~ /ǰ/i is true.
(U+01F0)
Still another type is lower Greek letters with a iota-subscript or a
iota adscript. I won't put in an example.
And the final types all have to do with putting a combining dot above i
and j in Azeri, Turkish, and Lithuanian locales, which perl doesn't
support in Unicode.
I think it is more correct for these things to match than not.
However, I'm not so sure when things are put in a character class. What
should /[ß]/i match? I'm tempted to say not 'ss' because character
classes match only a single character. But with the J with caron, that
really is like a single character, with the caron just a modifier. For
that I'm tempted to say yes 'ǰ' =~ /[ǰ]/i. The problem is that the
concept of a character class doesn't fit with the Unicode ideas. I
haven't done any research as to what other languages, etc do.
Would you like to know what happens today in perl? Well I'll tell you
anyway. /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false. In fact, every
other multi-char case ignored fold returns false. This in fact may be
the only time in perl history, savor the moment, when the infamous ß
gives an arguably more correct result than other characters.
The code in regcomp.c takes special pains to make all these match. But
it doesn't work, except in the [ß] case. So we don't have to worry
about breaking existing code if we decide it should work differently.
Let's look at it the other direction. Should ß =~ /ss/i ? Should 'ǰ'
=~ /ǰ/i ? They both are true currently. However, things like ß =~
/s{2}/i is false, and that seems inconsistent.
So, I'm not sure what the right answers are, but things are somewhat
broken today, and I'd like to get clarity on how it should work.
Thread Next