Front page | perl.perl5.porters |
Postings from November 2008
Re: Matching multi-character folds
Thread Previous
|
Thread Next
From:
demerphq
Date:
November 23, 2008 09:01
Subject:
Re: Matching multi-character folds
Message ID:
9b18b3110811230901q15e3f9fdle3e1bb24178c177a@mail.gmail.com
2008/11/23 karl williamson <public@khwilliamson.com>:
> This email is best viewed under utf8.
>
> The Unicode standard lists several different cases where a character (or
> code point if you prefer) should match a multiple character sequence when
> case is ignored.
>
> One of these is the oft mentioned in this list, German lower case sharp
> s or ß. 'ss' =~ /ß/i is true. (U+00DF)
0xDF is the only multi-codepoint folding character in the latin-1 range.
Also 0xDF is a "trickyfold" character meaning, that it can match
something of longer length (in terms of bytes) folded than unfolded.
> And perl does currently work that way if and only if the ß is stored in
> utf8. For the purposes of this email, I'm assuming all strings are in utf8.
> In a recent email, Yves has said that he thinks it is debatable whether or
> not it should work this way. My own view is that they should match, and it
> is beyond debate that the utf8ness of the strings should matter or not. To
> quote from the perltodo: "The handling of Unicode is unclean in many places.
> For example, the regexp engine matches in Unicode semantics whenever the
> string or the pattern is flagged as UTF-8, but that should not be dependent
> on an internal storage detail of the string. Likewise, case folding
> behaviour is dependent on the UTF8 internal flag being on or off."
What do you mean by "beyond debate" here?
Seems to me that there is a debate about whether unencoded
nonlocalized strings should be treated as ascii or as latin-1, and if
treated as latin-1 whether they should obey unicode foldcasing rules
or not.
>
> To start the discussion about the multi-char folds, I give examples of the
> various types defined in the standard. The first case is that of ß.
>
> Another case is ligatures (they don't view ß as a ligature, and I don't
> know why) So 'fi' =~ /fi/i is true. (U+FB01)
Prompted by your comment about 'ß' I did some searching for
information on ligatures and unicode and I was surprised how little
there was. The only ligature support seems to be for legacy conversion
reasons (for instance latin-1 equivalancy), and it seems that
ligatures are considered to be a presentation issue better left up to
the font and the font rendering engine. A good discussion being this:
http://unicode.org/faq/ligature_digraph.html
When I checked the unicode data files I didn't find anything about
ligatures outside of certain character names including the word
'LIGATURE', and some comments and commentary files mentioning that
some characters are ligatures. So I'm wondering what you were getting
at when you said "they don't view ß as a ligature, and I don't know
why".
> Another case is where there there is no corresponding upper or title
> case single precomposed character to a lower case one. For instance
> LATIN SMALL LETTER J WITH CARON, so 'ǰ' =~ /ǰ/i is true. (U+01F0)
>
> Still another case is lower Greek letters with a iota-subscript or a
> iota adscript. I won't put in an example.
>
> And the final cases all have to do with putting a combining dot above i
> and j in Azeri, Turkish, and Lithuanian locales, which perl doesn't support
> in Unicode.
>
> I think it is more correct for these things to match than not. However, I'm
> not so sure when things are put in a character class. What should /[ß]/i
> match? I'm tempted to say not 'ss' because character classes match only a
> single character. But with the J with caron, that really is like a single
> character, with the caron really just a modifier. For that I'm tempted to
> say yes 'ǰ' =~ /[ǰ]/i. The problem is that the concept of a character
> class doesn't fit with the Unicode ideas. I haven't done any research as to
> what other languages, etc do.
>
> Would you like to know what happens today in perl? Well I'll tell you
> anyway. /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false. In fact, every other
I cant repeat that. In bleadperl 'ǰ' =~ /[ǰ]/i matches fine as far as
i can tell.
What doesnt work is
fold('ǰ') =~ /[ǰ]/i
where fold('ǰ') is equivalent to "\x{6A}\x{30C}".
> multi-char fold returns false. This in fact may be the only time in perl
> history, savor the moment, when the infamous ß gives an arguably more
> correct result than other characters.
Hmm. Interesting. I cant decide to be happy about this, or sad.
>
> Now the code in regcomp.c takes special pains to make all these match. But
> it doesn't work, except in the [ß] case. So we don't have to worry about
> breaking existing code if we decide it should work differently.
>
> Let's look at it the other direction. Should ß =~ /ss/i ? Should 'ǰ' =~
> /ǰ/i ? They both are true currently. However, things like ß =~ /s{2}/i is
> false, and that seems inconsistent.
>
>
> So, I'm not sure what the right answers are, but things are broken today.
>
Yes, things are. I wrote the attached hacky script to parse out
CaseFolding.txt and test all the complex folding rules. The output is
below, the 'll', 'lu','ul','uu' means, 'latin' and 'unicode', with the
first letter representing the string, and the second the patterns
encoding. The description on the right is the test, with chars
represented by their hex representation, and separated by spaces in
the case of the folded string. The output on 5.8.9 looks different,
with more mistakes.
demerphq@gemini:~/blead/p4/lib/unicore$ ../../perl -I../../lib
test_case_folding.pl
LATIN SMALL LETTER SHARP S
ll '0073 0073' =~ /00DF/i
ll, ul, uu '0073 0073' =~ /[00DF]/i
LATIN CAPITAL LETTER I WITH DOT ABOVE
uu '0069 0307' =~ /[0130]/i
LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
uu '02BC 006E' =~ /[0149]/i
LATIN SMALL LETTER J WITH CARON
uu '006A 030C' =~ /[01F0]/i
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
uu '03B9 0308 0301' =~ /[0390]/i
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
uu '03C5 0308 0301' =~ /[03B0]/i
ARMENIAN SMALL LIGATURE ECH YIWN
uu '0565 0582' =~ /[0587]/i
LATIN SMALL LETTER H WITH LINE BELOW
uu '0068 0331' =~ /[1E96]/i
LATIN SMALL LETTER T WITH DIAERESIS
uu '0074 0308' =~ /[1E97]/i
LATIN SMALL LETTER W WITH RING ABOVE
uu '0077 030A' =~ /[1E98]/i
LATIN SMALL LETTER Y WITH RING ABOVE
uu '0079 030A' =~ /[1E99]/i
LATIN SMALL LETTER A WITH RIGHT HALF RING
uu '0061 02BE' =~ /[1E9A]/i
LATIN CAPITAL LETTER SHARP S
lu, uu '0073 0073' =~ /[1E9E]/i
GREEK SMALL LETTER UPSILON WITH PSILI
uu '03C5 0313' =~ /[1F50]/i
GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
uu '03C5 0313 0300' =~ /[1F52]/i
GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
uu '03C5 0313 0301' =~ /[1F54]/i
GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
uu '03C5 0313 0342' =~ /[1F56]/i
GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
uu '1F00 03B9' =~ /[1F80]/i
GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
uu '1F01 03B9' =~ /[1F81]/i
GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
uu '1F02 03B9' =~ /[1F82]/i
GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
uu '1F03 03B9' =~ /[1F83]/i
GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
uu '1F04 03B9' =~ /[1F84]/i
GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI
uu '1F05 03B9' =~ /[1F85]/i
GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F06 03B9' =~ /[1F86]/i
GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F07 03B9' =~ /[1F87]/i
GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
uu '1F00 03B9' =~ /[1F88]/i
GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
uu '1F01 03B9' =~ /[1F89]/i
GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
uu '1F02 03B9' =~ /[1F8A]/i
GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
uu '1F03 03B9' =~ /[1F8B]/i
GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
uu '1F04 03B9' =~ /[1F8C]/i
GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
uu '1F05 03B9' =~ /[1F8D]/i
GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F06 03B9' =~ /[1F8E]/i
GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F07 03B9' =~ /[1F8F]/i
GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI
uu '1F20 03B9' =~ /[1F90]/i
GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI
uu '1F21 03B9' =~ /[1F91]/i
GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI
uu '1F22 03B9' =~ /[1F92]/i
GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI
uu '1F23 03B9' =~ /[1F93]/i
GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI
uu '1F24 03B9' =~ /[1F94]/i
GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI
uu '1F25 03B9' =~ /[1F95]/i
GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F26 03B9' =~ /[1F96]/i
GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F27 03B9' =~ /[1F97]/i
GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
uu '1F20 03B9' =~ /[1F98]/i
GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
uu '1F21 03B9' =~ /[1F99]/i
GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
uu '1F22 03B9' =~ /[1F9A]/i
GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
uu '1F23 03B9' =~ /[1F9B]/i
GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
uu '1F24 03B9' =~ /[1F9C]/i
GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
uu '1F25 03B9' =~ /[1F9D]/i
GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F26 03B9' =~ /[1F9E]/i
GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F27 03B9' =~ /[1F9F]/i
GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI
uu '1F60 03B9' =~ /[1FA0]/i
GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI
uu '1F61 03B9' =~ /[1FA1]/i
GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
uu '1F62 03B9' =~ /[1FA2]/i
GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI
uu '1F63 03B9' =~ /[1FA3]/i
GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI
uu '1F64 03B9' =~ /[1FA4]/i
GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI
uu '1F65 03B9' =~ /[1FA5]/i
GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F66 03B9' =~ /[1FA6]/i
GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F67 03B9' =~ /[1FA7]/i
GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
uu '1F60 03B9' =~ /[1FA8]/i
GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
uu '1F61 03B9' =~ /[1FA9]/i
GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
uu '1F62 03B9' =~ /[1FAA]/i
GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
uu '1F63 03B9' =~ /[1FAB]/i
GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
uu '1F64 03B9' =~ /[1FAC]/i
GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
uu '1F65 03B9' =~ /[1FAD]/i
GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F66 03B9' =~ /[1FAE]/i
GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F67 03B9' =~ /[1FAF]/i
GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI
uu '1F70 03B9' =~ /[1FB2]/i
GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
uu '03B1 03B9' =~ /[1FB3]/i
GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI
uu '03AC 03B9' =~ /[1FB4]/i
GREEK SMALL LETTER ALPHA WITH PERISPOMENI
uu '03B1 0342' =~ /[1FB6]/i
GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
uu '03B1 0342 03B9' =~ /[1FB7]/i
GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
uu '03B1 03B9' =~ /[1FBC]/i
GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI
uu '1F74 03B9' =~ /[1FC2]/i
GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
uu '03B7 03B9' =~ /[1FC3]/i
GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI
uu '03AE 03B9' =~ /[1FC4]/i
GREEK SMALL LETTER ETA WITH PERISPOMENI
uu '03B7 0342' =~ /[1FC6]/i
GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
uu '03B7 0342 03B9' =~ /[1FC7]/i
GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
uu '03B7 03B9' =~ /[1FCC]/i
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
uu '03B9 0308 0300' =~ /[1FD2]/i
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
uu '03B9 0308 0301' =~ /[1FD3]/i
GREEK SMALL LETTER IOTA WITH PERISPOMENI
uu '03B9 0342' =~ /[1FD6]/i
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
uu '03B9 0308 0342' =~ /[1FD7]/i
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
uu '03C5 0308 0300' =~ /[1FE2]/i
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
uu '03C5 0308 0301' =~ /[1FE3]/i
GREEK SMALL LETTER RHO WITH PSILI
uu '03C1 0313' =~ /[1FE4]/i
GREEK SMALL LETTER UPSILON WITH PERISPOMENI
uu '03C5 0342' =~ /[1FE6]/i
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
uu '03C5 0308 0342' =~ /[1FE7]/i
GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI
uu '1F7C 03B9' =~ /[1FF2]/i
GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
uu '03C9 03B9' =~ /[1FF3]/i
GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI
uu '03CE 03B9' =~ /[1FF4]/i
GREEK SMALL LETTER OMEGA WITH PERISPOMENI
uu '03C9 0342' =~ /[1FF6]/i
GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
uu '03C9 0342 03B9' =~ /[1FF7]/i
GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
uu '03C9 03B9' =~ /[1FFC]/i
LATIN SMALL LIGATURE FF
lu, uu '0066 0066' =~ /[FB00]/i
LATIN SMALL LIGATURE FI
lu, uu '0066 0069' =~ /[FB01]/i
LATIN SMALL LIGATURE FL
lu, uu '0066 006C' =~ /[FB02]/i
LATIN SMALL LIGATURE FFI
lu, uu '0066 0066 0069' =~ /[FB03]/i
LATIN SMALL LIGATURE FFL
lu, uu '0066 0066 006C' =~ /[FB04]/i
LATIN SMALL LIGATURE LONG S T
lu, uu '0073 0074' =~ /[FB05]/i
LATIN SMALL LIGATURE ST
lu, uu '0073 0074' =~ /[FB06]/i
ARMENIAN SMALL LIGATURE MEN NOW
uu '0574 0576' =~ /[FB13]/i
ARMENIAN SMALL LIGATURE MEN ECH
uu '0574 0565' =~ /[FB14]/i
ARMENIAN SMALL LIGATURE MEN INI
uu '0574 056B' =~ /[FB15]/i
ARMENIAN SMALL LIGATURE VEW NOW
uu '057E 0576' =~ /[FB16]/i
ARMENIAN SMALL LIGATURE MEN XEH
uu '0574 056D' =~ /[FB17]/i
So its clear that multicode-point character class folding is broken
for some definition of expected behaviour.
I personally consider character class notation to be an abbreviation
of alternation. So a character class [xyz] is supposed to match the
same thing as (x|y|z). This implies that character classes have to be
able to match more than one character under case-folding rules. A lot
of external logic and at least some internal logic operates under this
assumption, so i dont think we can change it.
cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next