develooper Front page | perl.perl5.porters | Postings from November 2008

Re: Matching multi-character folds

Thread Previous | Thread Next
From:
demerphq
Date:
November 24, 2008 10:31
Subject:
Re: Matching multi-character folds
Message ID:
9b18b3110811241030r6a71f0b2m295fe8b5f2e6b4c1@mail.gmail.com
2008/11/23 karl williamson <public@khwilliamson.com>:
> demerphq wrote:
>> 2008/11/23 karl williamson <public@khwilliamson.com>:
[snip]
>>> One of these is the oft mentioned in this list, German lower case sharp
>>> s or ß.  'ss' =~ /ß/i is true. (U+00DF)
>>
>> 0xDF is the only multi-codepoint folding character in the latin-1 range.
>>
>> Also 0xDF is a "trickyfold" character meaning, that it can match
>> something of longer length (in terms of bytes) folded than unfolded.
>>
> There must be more to it than that, as the code indicates there are only
> three tricky fold characters, yet there are more that fit this definition.
>  For example U+023A which takes 2 bytes in UTF-8 folds to U+2C65 which takes
> 3.  They seem to work.

The three trickyfold characters tickle a bug in minlen logic of the
optimiser. The ones you mention dont, I think because they are both
one codepoint long. As far the mail history shows I dont think I
really got the bottom of the bug in the optimiser and worked around it
with the trickyfold construct as being the simplest solution.

As far as I recall /$char/i for unicode $char is stored casefolded at
compile time.  The bug basically came down to:

$df=chr(0xdf);
utf8::upgrade($df);
print $df=~/$df/i ? "ok" : "not ok";

which if inspected under use re 'debug' revealed that this was
internally converted into an EXACTF <ss> opcode. Which in turn caused
the minlen logic to fire, as it is two characters long. An exhaustive
search for these revealed problems only in the three codepoints we
covered, and my retest shows that we have more of this class with the
updates to unicode 5.1.  Exactly why the others did not fail was never
really clear. I did an exhaustive search and those were the ones I
found. The optimiser is a scary beast :-(

[snip]
>> What do you mean by "beyond debate" here?
>>
>> Seems to me that there is a debate about whether unencoded
>> nonlocalized strings should be treated as ascii or as latin-1, and if
>> treated as latin-1 whether they should obey unicode foldcasing rules
>> or not.
>>
> I thought that was settled.  While you were taking a break from p5p, I
> naively came in and started a discussion on it (there are various threads,
> but most include [perl #58182] in the subject).  There was agreement that
> they should match Unicode and I gave a very detailed proposal which the 5.12
> pumpking said sounded reasonable.  It was pointed out that perl5100delta
> says:
>
> | The handling of Unicode still is unclean in several places, where it's
> | dependent on whether a string is internally flagged as UTF-8. This will
> | be made more consistent in perl 5.12, but that won't be possible without
> | a certain amount of backwards incompatibility."
>
> Similarly in perltodo, as I quoted in the first email on this thread:
> "that should not be dependent on an internal storage detail of the string"
> meaning the utf8ness of a string should not affect its external semantics.
>
> It seems clear that it's been agreed that the utf8ness of a string should
> not affect its external behavior.  So what should the behavior be?  It has
> to be the Unicode behavior, for otherwise, the characters between 128 and
> 255 would never behave like Unicode.
>
> There are 3 main areas where things don't work.  (I believe that the
> problems with pack() have been fixed.)
>
> 1. uc(), lcfirst(), \U, etc.  I have submitted for review code that gives
> the same semantics for these whether or not the string is in utf8 or not.

This worries me, as it involves a fairly serious behaviour change. But
if its been decided then fine, at least it will be consistency.

> 2. \w, [:graph:], etc re matching.  I think the solution to this is in your
> RFC to make these just match ASCII or the current locale.  Then the utf8ness
> won't matter, except if someone's string gets converted to utf8, and then
> their locale most likely won't work properly.  That is why I said in an
> earlier email that I don't think strings should be upgraded to utf8 when
> "use locale" is in effect.  The RFC also solves the problem of, for example,
> \d matching things the programmer never intended, just because the string
> silently, somehow, got changed to utf8.  My proposal that I thought had been
> accepted was, for example, to make \w match the appropriate Latin1
> characters even when not in utf8. And I had working experimental code to do
> that.  But I think your RFC makes more sense.

Ok.

> 3. caseless re matching m/.../i  Again, perl has to change so that the
> utf8ness of the pattern doesn't matter.  One could do it by adding
> modifiers, as you originally suggested, like /u to force unicode semantics.
>  But I think you had pulled away from that idea.  I would be open to
> something like that, but I think there has to be a way for a programmer to
> make that the default, without forcing them to always remember to add the
> modifier.  Or one could do it by having the re code know about latin1
> semantics.  Again, I have mostly working code which doesn't change regcomp.c
> very much that does this.  I  do think overall that this is a better
> solution than the modifier one.  One consideration I have that has been
> mentioned in the documentation is that latin1 should be faster than utf8.  I
> think Tom may have said that he didn't find that to be the case in his
> experiments.

I'd like to see more on this. I do know that benchmarking the regex
engine is not easy. There are lots of special cases and things like
that to consider. Ive definitely seen utf8 have serious performance
consequences.

[snip]
>>> Another case is ligatures (they don't view ß as a ligature, and I don't
>>> know why)  So 'fi' =~ /fi/i is true. (U+FB01)
>>
>> Prompted by your comment about 'ß' I did some searching for
>> information on ligatures and unicode and I was surprised how little
>> there was. The only ligature support seems to be for legacy conversion
>> reasons (for instance latin-1 equivalancy), and it seems that
>> ligatures are considered to be a presentation issue better left up to
>> the font and the font rendering engine.  A good discussion being this:
>>
>> http://unicode.org/faq/ligature_digraph.html
>>
>> When I checked the unicode data files I didn't find anything about
>> ligatures outside of certain character names including the word
>> 'LIGATURE', and some comments and commentary files mentioning that
>> some characters are ligatures. So I'm wondering what you were getting
>> at when you said "they don't view ß as a ligature, and I don't know
>> why".
>>
> My source for that was lib/unicore/SpecialCasing.txt

Right, which includes a comment about some of the unusual forms. But
it is not a formal status or property of the characters.

[snip]
>>> Would you like to know what happens today in perl?  Well I'll tell you
>>> anyway.  /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false.  In fact, every
>>> other
>>
>> I cant repeat that. In bleadperl 'ǰ' =~ /[ǰ]/i matches fine as far as
>> i can tell.
>>
>> What doesnt work is
>>
>> fold('ǰ') =~ /[ǰ]/i
>>
>> where fold('ǰ') is equivalent to "\x{6A}\x{30C}".
>>
>
> I don't understand.  I just tested again with the perl I have on my machine
> that I think is today's bleadperl, and it failed.  But in any event as you
> agree below, there are a number of things broken.

Can you post a oneliner that doesnt contain unicode in it to test
with? In other words coded so it can be expressed in ascii, whatever
the code itself does?

>>> multi-char fold returns false.  This in fact may be the only time in perl
>>> history, savor the moment, when the infamous ß gives an arguably more
>>> correct result than other characters.
>>
>> Hmm. Interesting. I cant decide to be happy about this, or sad.
>>
> The only reason it works is because for single character char classes, they
> get optimized out, and somehow, it works.  [ßa] doesn't work.

Ah. Sigh. So they turn into EXACTF instead of ANYOF. I forgot about that.

>
>>> Now the code in regcomp.c takes special pains to make all these match.
>>> But
>>> it doesn't work, except in the [ß] case.  So we don't have to worry about
>>> breaking existing code if we decide it should work differently.
>>>
>>> Let's look at it the other direction.  Should ß =~ /ss/i ?  Should 'ǰ' =~
>>> /ǰ/i ?  They both are true currently.  However, things like ß =~ /s{2}/i
>>> is
>>> false, and that seems inconsistent.
>>>
>>>
>>> So, I'm not sure what the right answers are, but things are broken today.
>>>
>>
>> Yes, things are.  I wrote the attached hacky script to parse out
>> CaseFolding.txt and test all the complex folding rules. The output is
>> below, the 'll', 'lu','ul','uu' means, 'latin' and 'unicode', with the
>> first letter representing the string, and the second the patterns
>> encoding. The description on the right is the test, with chars
>> represented by their hex representation, and separated by spaces in
>> the case of the folded string.  The output on 5.8.9 looks different,
>> with more mistakes.
>>
>> demerphq@gemini:~/blead/p4/lib/unicore$ ../../perl -I../../lib
>> test_case_folding.pl
>> LATIN SMALL LETTER SHARP S
>>        ll                      '0073 0073' =~ /00DF/i
>>        ll, ul, uu              '0073 0073' =~ /[00DF]/i

ll is expected to fail here under the current rules.

[snip]
>> LATIN CAPITAL LETTER SHARP S
>>        lu, uu                  '0073 0073' =~ /[1E9E]/i

lu probably fails because of the minlen bug.

[snip]
>> LATIN SMALL LIGATURE FF
>>        lu, uu                  '0066 0066' =~ /[FB00]/i
>> LATIN SMALL LIGATURE FI
>>        lu, uu                  '0066 0069' =~ /[FB01]/i
>> LATIN SMALL LIGATURE FL
>>        lu, uu                  '0066 006C' =~ /[FB02]/i
>> LATIN SMALL LIGATURE FFI
>>        lu, uu                  '0066 0066 0069' =~ /[FB03]/i
>> LATIN SMALL LIGATURE FFL
>>        lu, uu                  '0066 0066 006C' =~ /[FB04]/i
>> LATIN SMALL LIGATURE LONG S T
>>        lu, uu                  '0073 0074' =~ /[FB05]/i
>> LATIN SMALL LIGATURE ST
>>        lu, uu                  '0073 0074' =~ /[FB06]/i

These lu's might fail because of the minlen bug. Are these new to 5.1?

> What Yves didn't mention to those of you reading along, is that only the
> failures were printed above.

Yes correct, and we only test the possible combinations. So only \xDF
has 'll' or 'ul' and most only have 'uu'.

> When I run his program on 5.8 vs blead on the
> same version of the Unicode database, the only differences I saw were
> related, I think, to Yves fixing things in 5.10 with his tricky fold
> addition, and the new in Unicode 5.1 upper case version of ß.  I don't
> understand off-hand why that would be different.

Because its not being handled by the trickfold logic. Basically its
the same problem as the lower case but it hasn't been added
regcharclass.pl. And none of the special cases coded into the regex
engine to deal with 0xDF have been added to the engine for its
majestic brother.

>> So its clear that multicode-point character class folding is broken
>> for some definition of expected behaviour.
>>
>> I personally consider character class notation to be an abbreviation
>> of alternation. So a character class [xyz] is supposed to match the
>> same thing as (x|y|z).  This implies that character classes have to be
>> able to match more than one character under case-folding rules.  A lot
>> of external logic and at least some internal logic operates under this
>> assumption, so i dont think we can change it.
>>
>
> That sounds right.

Im trying to imagine a way to do this that doesn't involve a pretty
considerable redesign of how character classes work, and not coming up
with much.

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About