Front page | perl.perl5.porters |
Postings from November 2008
Re: Matching multi-character folds
Thread Previous
|
Thread Next
From:
karl williamson
Date:
November 23, 2008 14:23
Subject:
Re: Matching multi-character folds
Message ID:
4929D7B3.5050108@khwilliamson.com
demerphq wrote:
> 2008/11/23 karl williamson <public@khwilliamson.com>:
>> This email is best viewed under utf8.
>>
>> The Unicode standard lists several different cases where a character (or
>> code point if you prefer) should match a multiple character sequence when
>> case is ignored.
>>
>> One of these is the oft mentioned in this list, German lower case sharp
>> s or ß. 'ss' =~ /ß/i is true. (U+00DF)
>
> 0xDF is the only multi-codepoint folding character in the latin-1 range.
>
> Also 0xDF is a "trickyfold" character meaning, that it can match
> something of longer length (in terms of bytes) folded than unfolded.
>
There must be more to it than that, as the code indicates there are only
three tricky fold characters, yet there are more that fit this
definition. For example U+023A which takes 2 bytes in UTF-8 folds to
U+2C65 which takes 3. They seem to work.
>> And perl does currently work that way if and only if the ß is stored in
>> utf8. For the purposes of this email, I'm assuming all strings are in utf8.
>
>
>> In a recent email, Yves has said that he thinks it is debatable whether or
>> not it should work this way. My own view is that they should match, and it
>> is beyond debate that the utf8ness of the strings should matter or not. To
>> quote from the perltodo: "The handling of Unicode is unclean in many places.
>> For example, the regexp engine matches in Unicode semantics whenever the
>> string or the pattern is flagged as UTF-8, but that should not be dependent
>> on an internal storage detail of the string. Likewise, case folding
>> behaviour is dependent on the UTF8 internal flag being on or off."
>
> What do you mean by "beyond debate" here?
>
> Seems to me that there is a debate about whether unencoded
> nonlocalized strings should be treated as ascii or as latin-1, and if
> treated as latin-1 whether they should obey unicode foldcasing rules
> or not.
>
I thought that was settled. While you were taking a break from p5p, I
naively came in and started a discussion on it (there are various
threads, but most include [perl #58182] in the subject). There was
agreement that they should match Unicode and I gave a very detailed
proposal which the 5.12 pumpking said sounded reasonable. It was
pointed out that perl5100delta says:
| The handling of Unicode still is unclean in several places, where it's
| dependent on whether a string is internally flagged as UTF-8. This will
| be made more consistent in perl 5.12, but that won't be possible without
| a certain amount of backwards incompatibility."
Similarly in perltodo, as I quoted in the first email on this thread:
"that should not be dependent on an internal storage detail of the
string" meaning the utf8ness of a string should not affect its external
semantics.
It seems clear that it's been agreed that the utf8ness of a string
should not affect its external behavior. So what should the behavior
be? It has to be the Unicode behavior, for otherwise, the characters
between 128 and 255 would never behave like Unicode.
There are 3 main areas where things don't work. (I believe that the
problems with pack() have been fixed.)
1. uc(), lcfirst(), \U, etc. I have submitted for review code that
gives the same semantics for these whether or not the string is in utf8
or not.
2. \w, [:graph:], etc re matching. I think the solution to this is in
your RFC to make these just match ASCII or the current locale. Then the
utf8ness won't matter, except if someone's string gets converted to
utf8, and then their locale most likely won't work properly. That is
why I said in an earlier email that I don't think strings should be
upgraded to utf8 when "use locale" is in effect. The RFC also solves
the problem of, for example, \d matching things the programmer never
intended, just because the string silently, somehow, got changed to
utf8. My proposal that I thought had been accepted was, for example, to
make \w match the appropriate Latin1 characters even when not in utf8.
And I had working experimental code to do that. But I think your RFC
makes more sense.
3. caseless re matching m/.../i Again, perl has to change so that the
utf8ness of the pattern doesn't matter. One could do it by adding
modifiers, as you originally suggested, like /u to force unicode
semantics. But I think you had pulled away from that idea. I would be
open to something like that, but I think there has to be a way for a
programmer to make that the default, without forcing them to always
remember to add the modifier. Or one could do it by having the re code
know about latin1 semantics. Again, I have mostly working code which
doesn't change regcomp.c very much that does this. I do think overall
that this is a better solution than the modifier one. One consideration
I have that has been mentioned in the documentation is that latin1
should be faster than utf8. I think Tom may have said that he didn't
find that to be the case in his experiments.
>> To start the discussion about the multi-char folds, I give examples of the
>> various types defined in the standard. The first case is that of ß.
>>
>> Another case is ligatures (they don't view ß as a ligature, and I don't
>> know why) So 'fi' =~ /fi/i is true. (U+FB01)
>
> Prompted by your comment about 'ß' I did some searching for
> information on ligatures and unicode and I was surprised how little
> there was. The only ligature support seems to be for legacy conversion
> reasons (for instance latin-1 equivalancy), and it seems that
> ligatures are considered to be a presentation issue better left up to
> the font and the font rendering engine. A good discussion being this:
>
> http://unicode.org/faq/ligature_digraph.html
>
> When I checked the unicode data files I didn't find anything about
> ligatures outside of certain character names including the word
> 'LIGATURE', and some comments and commentary files mentioning that
> some characters are ligatures. So I'm wondering what you were getting
> at when you said "they don't view ß as a ligature, and I don't know
> why".
>
My source for that was lib/unicore/SpecialCasing.txt
>> Another case is where there there is no corresponding upper or title
>> case single precomposed character to a lower case one. For instance
>> LATIN SMALL LETTER J WITH CARON, so 'ǰ' =~ /ǰ/i is true. (U+01F0)
>>
>> Still another case is lower Greek letters with a iota-subscript or a
>> iota adscript. I won't put in an example.
>>
>> And the final cases all have to do with putting a combining dot above i
>> and j in Azeri, Turkish, and Lithuanian locales, which perl doesn't support
>> in Unicode.
>>
>> I think it is more correct for these things to match than not. However, I'm
>> not so sure when things are put in a character class. What should /[ß]/i
>> match? I'm tempted to say not 'ss' because character classes match only a
>> single character. But with the J with caron, that really is like a single
>> character, with the caron really just a modifier. For that I'm tempted to
>> say yes 'ǰ' =~ /[ǰ]/i. The problem is that the concept of a character
>> class doesn't fit with the Unicode ideas. I haven't done any research as to
>> what other languages, etc do.
>>
>> Would you like to know what happens today in perl? Well I'll tell you
>> anyway. /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false. In fact, every other
>
> I cant repeat that. In bleadperl 'ǰ' =~ /[ǰ]/i matches fine as far as
> i can tell.
>
> What doesnt work is
>
> fold('ǰ') =~ /[ǰ]/i
>
> where fold('ǰ') is equivalent to "\x{6A}\x{30C}".
>
I don't understand. I just tested again with the perl I have on my
machine that I think is today's bleadperl, and it failed. But in any
event as you agree below, there are a number of things broken.
>> multi-char fold returns false. This in fact may be the only time in perl
>> history, savor the moment, when the infamous ß gives an arguably more
>> correct result than other characters.
>
> Hmm. Interesting. I cant decide to be happy about this, or sad.
>
The only reason it works is because for single character char classes,
they get optimized out, and somehow, it works. [ßa] doesn't work.
>> Now the code in regcomp.c takes special pains to make all these match. But
>> it doesn't work, except in the [ß] case. So we don't have to worry about
>> breaking existing code if we decide it should work differently.
>>
>> Let's look at it the other direction. Should ß =~ /ss/i ? Should 'ǰ' =~
>> /ǰ/i ? They both are true currently. However, things like ß =~ /s{2}/i is
>> false, and that seems inconsistent.
>>
>>
>> So, I'm not sure what the right answers are, but things are broken today.
>>
>
> Yes, things are. I wrote the attached hacky script to parse out
> CaseFolding.txt and test all the complex folding rules. The output is
> below, the 'll', 'lu','ul','uu' means, 'latin' and 'unicode', with the
> first letter representing the string, and the second the patterns
> encoding. The description on the right is the test, with chars
> represented by their hex representation, and separated by spaces in
> the case of the folded string. The output on 5.8.9 looks different,
> with more mistakes.
>
> demerphq@gemini:~/blead/p4/lib/unicore$ ../../perl -I../../lib
> test_case_folding.pl
> LATIN SMALL LETTER SHARP S
> ll '0073 0073' =~ /00DF/i
> ll, ul, uu '0073 0073' =~ /[00DF]/i
> LATIN CAPITAL LETTER I WITH DOT ABOVE
> uu '0069 0307' =~ /[0130]/i
> LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
> uu '02BC 006E' =~ /[0149]/i
> LATIN SMALL LETTER J WITH CARON
> uu '006A 030C' =~ /[01F0]/i
> GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
> uu '03B9 0308 0301' =~ /[0390]/i
> GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
> uu '03C5 0308 0301' =~ /[03B0]/i
> ARMENIAN SMALL LIGATURE ECH YIWN
> uu '0565 0582' =~ /[0587]/i
> LATIN SMALL LETTER H WITH LINE BELOW
> uu '0068 0331' =~ /[1E96]/i
> LATIN SMALL LETTER T WITH DIAERESIS
> uu '0074 0308' =~ /[1E97]/i
> LATIN SMALL LETTER W WITH RING ABOVE
> uu '0077 030A' =~ /[1E98]/i
> LATIN SMALL LETTER Y WITH RING ABOVE
> uu '0079 030A' =~ /[1E99]/i
> LATIN SMALL LETTER A WITH RIGHT HALF RING
> uu '0061 02BE' =~ /[1E9A]/i
> LATIN CAPITAL LETTER SHARP S
> lu, uu '0073 0073' =~ /[1E9E]/i
> GREEK SMALL LETTER UPSILON WITH PSILI
> uu '03C5 0313' =~ /[1F50]/i
> GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
> uu '03C5 0313 0300' =~ /[1F52]/i
> GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
> uu '03C5 0313 0301' =~ /[1F54]/i
> GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
> uu '03C5 0313 0342' =~ /[1F56]/i
> GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
> uu '1F00 03B9' =~ /[1F80]/i
> GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
> uu '1F01 03B9' =~ /[1F81]/i
> GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
> uu '1F02 03B9' =~ /[1F82]/i
> GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
> uu '1F03 03B9' =~ /[1F83]/i
> GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
> uu '1F04 03B9' =~ /[1F84]/i
> GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI
> uu '1F05 03B9' =~ /[1F85]/i
> GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F06 03B9' =~ /[1F86]/i
> GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F07 03B9' =~ /[1F87]/i
> GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
> uu '1F00 03B9' =~ /[1F88]/i
> GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
> uu '1F01 03B9' =~ /[1F89]/i
> GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
> uu '1F02 03B9' =~ /[1F8A]/i
> GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
> uu '1F03 03B9' =~ /[1F8B]/i
> GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
> uu '1F04 03B9' =~ /[1F8C]/i
> GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
> uu '1F05 03B9' =~ /[1F8D]/i
> GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F06 03B9' =~ /[1F8E]/i
> GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F07 03B9' =~ /[1F8F]/i
> GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI
> uu '1F20 03B9' =~ /[1F90]/i
> GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI
> uu '1F21 03B9' =~ /[1F91]/i
> GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI
> uu '1F22 03B9' =~ /[1F92]/i
> GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI
> uu '1F23 03B9' =~ /[1F93]/i
> GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI
> uu '1F24 03B9' =~ /[1F94]/i
> GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI
> uu '1F25 03B9' =~ /[1F95]/i
> GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F26 03B9' =~ /[1F96]/i
> GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F27 03B9' =~ /[1F97]/i
> GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
> uu '1F20 03B9' =~ /[1F98]/i
> GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
> uu '1F21 03B9' =~ /[1F99]/i
> GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
> uu '1F22 03B9' =~ /[1F9A]/i
> GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
> uu '1F23 03B9' =~ /[1F9B]/i
> GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
> uu '1F24 03B9' =~ /[1F9C]/i
> GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
> uu '1F25 03B9' =~ /[1F9D]/i
> GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F26 03B9' =~ /[1F9E]/i
> GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F27 03B9' =~ /[1F9F]/i
> GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI
> uu '1F60 03B9' =~ /[1FA0]/i
> GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI
> uu '1F61 03B9' =~ /[1FA1]/i
> GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
> uu '1F62 03B9' =~ /[1FA2]/i
> GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI
> uu '1F63 03B9' =~ /[1FA3]/i
> GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI
> uu '1F64 03B9' =~ /[1FA4]/i
> GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI
> uu '1F65 03B9' =~ /[1FA5]/i
> GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F66 03B9' =~ /[1FA6]/i
> GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F67 03B9' =~ /[1FA7]/i
> GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
> uu '1F60 03B9' =~ /[1FA8]/i
> GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
> uu '1F61 03B9' =~ /[1FA9]/i
> GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
> uu '1F62 03B9' =~ /[1FAA]/i
> GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
> uu '1F63 03B9' =~ /[1FAB]/i
> GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
> uu '1F64 03B9' =~ /[1FAC]/i
> GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
> uu '1F65 03B9' =~ /[1FAD]/i
> GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F66 03B9' =~ /[1FAE]/i
> GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F67 03B9' =~ /[1FAF]/i
> GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI
> uu '1F70 03B9' =~ /[1FB2]/i
> GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
> uu '03B1 03B9' =~ /[1FB3]/i
> GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI
> uu '03AC 03B9' =~ /[1FB4]/i
> GREEK SMALL LETTER ALPHA WITH PERISPOMENI
> uu '03B1 0342' =~ /[1FB6]/i
> GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
> uu '03B1 0342 03B9' =~ /[1FB7]/i
> GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
> uu '03B1 03B9' =~ /[1FBC]/i
> GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI
> uu '1F74 03B9' =~ /[1FC2]/i
> GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
> uu '03B7 03B9' =~ /[1FC3]/i
> GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI
> uu '03AE 03B9' =~ /[1FC4]/i
> GREEK SMALL LETTER ETA WITH PERISPOMENI
> uu '03B7 0342' =~ /[1FC6]/i
> GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
> uu '03B7 0342 03B9' =~ /[1FC7]/i
> GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
> uu '03B7 03B9' =~ /[1FCC]/i
> GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
> uu '03B9 0308 0300' =~ /[1FD2]/i
> GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
> uu '03B9 0308 0301' =~ /[1FD3]/i
> GREEK SMALL LETTER IOTA WITH PERISPOMENI
> uu '03B9 0342' =~ /[1FD6]/i
> GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
> uu '03B9 0308 0342' =~ /[1FD7]/i
> GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
> uu '03C5 0308 0300' =~ /[1FE2]/i
> GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
> uu '03C5 0308 0301' =~ /[1FE3]/i
> GREEK SMALL LETTER RHO WITH PSILI
> uu '03C1 0313' =~ /[1FE4]/i
> GREEK SMALL LETTER UPSILON WITH PERISPOMENI
> uu '03C5 0342' =~ /[1FE6]/i
> GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
> uu '03C5 0308 0342' =~ /[1FE7]/i
> GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI
> uu '1F7C 03B9' =~ /[1FF2]/i
> GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
> uu '03C9 03B9' =~ /[1FF3]/i
> GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI
> uu '03CE 03B9' =~ /[1FF4]/i
> GREEK SMALL LETTER OMEGA WITH PERISPOMENI
> uu '03C9 0342' =~ /[1FF6]/i
> GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
> uu '03C9 0342 03B9' =~ /[1FF7]/i
> GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
> uu '03C9 03B9' =~ /[1FFC]/i
> LATIN SMALL LIGATURE FF
> lu, uu '0066 0066' =~ /[FB00]/i
> LATIN SMALL LIGATURE FI
> lu, uu '0066 0069' =~ /[FB01]/i
> LATIN SMALL LIGATURE FL
> lu, uu '0066 006C' =~ /[FB02]/i
> LATIN SMALL LIGATURE FFI
> lu, uu '0066 0066 0069' =~ /[FB03]/i
> LATIN SMALL LIGATURE FFL
> lu, uu '0066 0066 006C' =~ /[FB04]/i
> LATIN SMALL LIGATURE LONG S T
> lu, uu '0073 0074' =~ /[FB05]/i
> LATIN SMALL LIGATURE ST
> lu, uu '0073 0074' =~ /[FB06]/i
> ARMENIAN SMALL LIGATURE MEN NOW
> uu '0574 0576' =~ /[FB13]/i
> ARMENIAN SMALL LIGATURE MEN ECH
> uu '0574 0565' =~ /[FB14]/i
> ARMENIAN SMALL LIGATURE MEN INI
> uu '0574 056B' =~ /[FB15]/i
> ARMENIAN SMALL LIGATURE VEW NOW
> uu '057E 0576' =~ /[FB16]/i
> ARMENIAN SMALL LIGATURE MEN XEH
> uu '0574 056D' =~ /[FB17]/i
>
What Yves didn't mention to those of you reading along, is that only the
failures were printed above. When I run his program on 5.8 vs blead on
the same version of the Unicode database, the only differences I saw
were related, I think, to Yves fixing things in 5.10 with his tricky
fold addition, and the new in Unicode 5.1 upper case version of ß. I
don't understand off-hand why that would be different.
>
> So its clear that multicode-point character class folding is broken
> for some definition of expected behaviour.
>
> I personally consider character class notation to be an abbreviation
> of alternation. So a character class [xyz] is supposed to match the
> same thing as (x|y|z). This implies that character classes have to be
> able to match more than one character under case-folding rules. A lot
> of external logic and at least some internal logic operates under this
> assumption, so i dont think we can change it.
>
That sounds right.
Thread Previous
|
Thread Next