develooper Front page | perl.perl5.porters | Postings from November 2008

Re: Matching multi-character folds

Thread Previous | Thread Next
karl williamson
November 26, 2008 19:45
Re: Matching multi-character folds
Message ID:
demerphq wrote:
> 2008/11/23 karl williamson <>:
>> demerphq wrote:
>>> 2008/11/23 karl williamson <>:
> [snip]
>>>> One of these is the oft mentioned in this list, German lower case sharp
>>>> s or ß.  'ss' =~ /ß/i is true. (U+00DF)
>>> 0xDF is the only multi-codepoint folding character in the latin-1 range.
>>> Also 0xDF is a "trickyfold" character meaning, that it can match
>>> something of longer length (in terms of bytes) folded than unfolded.
>> There must be more to it than that, as the code indicates there are only
>> three tricky fold characters, yet there are more that fit this definition.
>>  For example U+023A which takes 2 bytes in UTF-8 folds to U+2C65 which takes
>> 3.  They seem to work.
> The three trickyfold characters tickle a bug in minlen logic of the
> optimiser. The ones you mention dont, I think because they are both
> one codepoint long. As far the mail history shows I dont think I
> really got the bottom of the bug in the optimiser and worked around it
> with the trickyfold construct as being the simplest solution.
> As far as I recall /$char/i for unicode $char is stored casefolded at
> compile time.  The bug basically came down to:
> $df=chr(0xdf);
> utf8::upgrade($df);
> print $df=~/$df/i ? "ok" : "not ok";
> which if inspected under use re 'debug' revealed that this was
> internally converted into an EXACTF <ss> opcode. Which in turn caused
> the minlen logic to fire, as it is two characters long. An exhaustive
> search for these revealed problems only in the three codepoints we
> covered, and my retest shows that we have more of this class with the
> updates to unicode 5.1.  Exactly why the others did not fail was never
> really clear. I did an exhaustive search and those were the ones I
> found. The optimiser is a scary beast :-(
> [snip]
>>> What do you mean by "beyond debate" here?
>>> Seems to me that there is a debate about whether unencoded
>>> nonlocalized strings should be treated as ascii or as latin-1, and if
>>> treated as latin-1 whether they should obey unicode foldcasing rules
>>> or not.
>> I thought that was settled.  While you were taking a break from p5p, I
>> naively came in and started a discussion on it (there are various threads,
>> but most include [perl #58182] in the subject).  There was agreement that
>> they should match Unicode and I gave a very detailed proposal which the 5.12
>> pumpking said sounded reasonable.  It was pointed out that perl5100delta
>> says:
>> | The handling of Unicode still is unclean in several places, where it's
>> | dependent on whether a string is internally flagged as UTF-8. This will
>> | be made more consistent in perl 5.12, but that won't be possible without
>> | a certain amount of backwards incompatibility."
>> Similarly in perltodo, as I quoted in the first email on this thread:
>> "that should not be dependent on an internal storage detail of the string"
>> meaning the utf8ness of a string should not affect its external semantics.
>> It seems clear that it's been agreed that the utf8ness of a string should
>> not affect its external behavior.  So what should the behavior be?  It has
>> to be the Unicode behavior, for otherwise, the characters between 128 and
>> 255 would never behave like Unicode.
>> There are 3 main areas where things don't work.  (I believe that the
>> problems with pack() have been fixed.)
>> 1. uc(), lcfirst(), \U, etc.  I have submitted for review code that gives
>> the same semantics for these whether or not the string is in utf8 or not.
> This worries me, as it involves a fairly serious behaviour change. But
> if its been decided then fine, at least it will be consistency.
And since it is such a change, it will require a pragma to enable in 
5.10, becoming the default in 5.12.
>> 2. \w, [:graph:], etc re matching.  I think the solution to this is in your
>> RFC to make these just match ASCII or the current locale.  Then the utf8ness
>> won't matter, except if someone's string gets converted to utf8, and then
>> their locale most likely won't work properly.  That is why I said in an
>> earlier email that I don't think strings should be upgraded to utf8 when
>> "use locale" is in effect.  The RFC also solves the problem of, for example,
>> \d matching things the programmer never intended, just because the string
>> silently, somehow, got changed to utf8.  My proposal that I thought had been
>> accepted was, for example, to make \w match the appropriate Latin1
>> characters even when not in utf8. And I had working experimental code to do
>> that.  But I think your RFC makes more sense.
> Ok.
>> 3. caseless re matching m/.../i  Again, perl has to change so that the
>> utf8ness of the pattern doesn't matter.  One could do it by adding
>> modifiers, as you originally suggested, like /u to force unicode semantics.
>>  But I think you had pulled away from that idea.  I would be open to
>> something like that, but I think there has to be a way for a programmer to
>> make that the default, without forcing them to always remember to add the
>> modifier.  Or one could do it by having the re code know about latin1
>> semantics.  Again, I have mostly working code which doesn't change regcomp.c
>> very much that does this.  I  do think overall that this is a better
>> solution than the modifier one.  One consideration I have that has been
>> mentioned in the documentation is that latin1 should be faster than utf8.  I
>> think Tom may have said that he didn't find that to be the case in his
>> experiments.
> I'd like to see more on this. I do know that benchmarking the regex
> engine is not easy. There are lots of special cases and things like
> that to consider. Ive definitely seen utf8 have serious performance
> consequences.
The goal should be to not have a programmer have to know about the 
internal storage method of a string.  From looking at the code, I don't 
see how going to utf8 could possibly not have a significant impact.  In 
a program I wrote, I looked at the documentation and bent over backwards 
to keep from going outside the Latin1 range, so as to not invoke utf8. 
Then I discovered that Encode always goes to utf8, so my efforts were 
for naught.
> [snip]
>>>> Another case is ligatures (they don't view ß as a ligature, and I don't
>>>> know why)  So 'fi' =~ /fi/i is true. (U+FB01)
>>> Prompted by your comment about 'ß' I did some searching for
>>> information on ligatures and unicode and I was surprised how little
>>> there was. The only ligature support seems to be for legacy conversion
>>> reasons (for instance latin-1 equivalancy), and it seems that
>>> ligatures are considered to be a presentation issue better left up to
>>> the font and the font rendering engine.  A good discussion being this:
>>> When I checked the unicode data files I didn't find anything about
>>> ligatures outside of certain character names including the word
>>> 'LIGATURE', and some comments and commentary files mentioning that
>>> some characters are ligatures. So I'm wondering what you were getting
>>> at when you said "they don't view ß as a ligature, and I don't know
>>> why".
>> My source for that was lib/unicore/SpecialCasing.txt
> Right, which includes a comment about some of the unusual forms. But
> it is not a formal status or property of the characters.
> [snip]
>>>> Would you like to know what happens today in perl?  Well I'll tell you
>>>> anyway.  /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false.  In fact, every
>>>> other
>>> I cant repeat that. In bleadperl 'ǰ' =~ /[ǰ]/i matches fine as far as
>>> i can tell.
>>> What doesnt work is
>>> fold('ǰ') =~ /[ǰ]/i
>>> where fold('ǰ') is equivalent to "\x{6A}\x{30C}".
>> I don't understand.  I just tested again with the perl I have on my machine
>> that I think is today's bleadperl, and it failed.  But in any event as you
>> agree below, there are a number of things broken.
> Can you post a oneliner that doesnt contain unicode in it to test
> with? In other words coded so it can be expressed in ascii, whatever
> the code itself does?
Actually, when I look at your test cases, I see it is one that failed:
         uu                      '006A 030C' =~ /[01F0]/i
>>>> multi-char fold returns false.  This in fact may be the only time in perl
>>>> history, savor the moment, when the infamous ß gives an arguably more
>>>> correct result than other characters.
>>> Hmm. Interesting. I cant decide to be happy about this, or sad.
>> The only reason it works is because for single character char classes, they
>> get optimized out, and somehow, it works.  [ßa] doesn't work.
> Ah. Sigh. So they turn into EXACTF instead of ANYOF. I forgot about that.
>>>> Now the code in regcomp.c takes special pains to make all these match.
>>>> But
>>>> it doesn't work, except in the [ß] case.  So we don't have to worry about
>>>> breaking existing code if we decide it should work differently.
>>>> Let's look at it the other direction.  Should ß =~ /ss/i ?  Should 'ǰ' =~
>>>> /ǰ/i ?  They both are true currently.  However, things like ß =~ /s{2}/i
>>>> is
>>>> false, and that seems inconsistent.
>>>> So, I'm not sure what the right answers are, but things are broken today.
>>> Yes, things are.  I wrote the attached hacky script to parse out
>>> CaseFolding.txt and test all the complex folding rules. The output is
>>> below, the 'll', 'lu','ul','uu' means, 'latin' and 'unicode', with the
>>> first letter representing the string, and the second the patterns
>>> encoding. The description on the right is the test, with chars
>>> represented by their hex representation, and separated by spaces in
>>> the case of the folded string.  The output on 5.8.9 looks different,
>>> with more mistakes.
>>> demerphq@gemini:~/blead/p4/lib/unicore$ ../../perl -I../../lib
>>>        ll                      '0073 0073' =~ /00DF/i
>>>        ll, ul, uu              '0073 0073' =~ /[00DF]/i
> ll is expected to fail here under the current rules.
> [snip]
>>>        lu, uu                  '0073 0073' =~ /[1E9E]/i
> lu probably fails because of the minlen bug.
> [snip]
>>>        lu, uu                  '0066 0066' =~ /[FB00]/i
>>>        lu, uu                  '0066 0069' =~ /[FB01]/i
>>>        lu, uu                  '0066 006C' =~ /[FB02]/i
>>>        lu, uu                  '0066 0066 0069' =~ /[FB03]/i
>>>        lu, uu                  '0066 0066 006C' =~ /[FB04]/i
>>>        lu, uu                  '0073 0074' =~ /[FB05]/i
>>>        lu, uu                  '0073 0074' =~ /[FB06]/i
> These lu's might fail because of the minlen bug. Are these new to 5.1?
These latin ligatures were in Unicode 1.1.  There's some code in 
regclass() in regcomp.c for EBCDIC only  that looks bogus to me that is 
attempting to handle some of these.  I don't understand why just some 
would need special handling.

>> What Yves didn't mention to those of you reading along, is that only the
>> failures were printed above.
> Yes correct, and we only test the possible combinations. So only \xDF
> has 'll' or 'ul' and most only have 'uu'.
>> When I run his program on 5.8 vs blead on the
>> same version of the Unicode database, the only differences I saw were
>> related, I think, to Yves fixing things in 5.10 with his tricky fold
>> addition, and the new in Unicode 5.1 upper case version of ß.  I don't
>> understand off-hand why that would be different.
> Because its not being handled by the trickfold logic. Basically its
> the same problem as the lower case but it hasn't been added
> And none of the special cases coded into the regex
> engine to deal with 0xDF have been added to the engine for its
> majestic brother.
>>> So its clear that multicode-point character class folding is broken
>>> for some definition of expected behaviour.
>>> I personally consider character class notation to be an abbreviation
>>> of alternation. So a character class [xyz] is supposed to match the
>>> same thing as (x|y|z).  This implies that character classes have to be
>>> able to match more than one character under case-folding rules.  A lot
>>> of external logic and at least some internal logic operates under this
>>> assumption, so i dont think we can change it.
>> That sounds right.
> Im trying to imagine a way to do this that doesn't involve a pretty
> considerable redesign of how character classes work, and not coming up
> with much.
> Yves
Keep in mind that it works for the vast majority of Unicode characters, 
and fails only on a few, and only when there is a multi-character fold. 
  However, we don't even attempt to implement some things that Unicode 
would want us to, such as treating two strings that are in different 
canonical normalizations as equivalent.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About