develooper Front page | perl.perl5.porters | Postings from December 2009

Re: RFC: regex /i folding always use utf8?

Thread Previous | Thread Next
From:
karl williamson
Date:
December 9, 2009 21:20
Subject:
Re: RFC: regex /i folding always use utf8?
Message ID:
4B2084FD.6000203@khwilliamson.com
demerphq wrote:
> 2009/12/7 karl williamson <public@khwilliamson.com>:
>> I have been trying to solve the discrepancies involving the semantics being
>> different when a scalar is stored in utf8 or not.
>>
>> To review, there are 3 major and 1 very minor known areas where this occurs.
>>  Blead already contains a fix for one of the major areas: case changing via
>> uc() and its cousins.
>>
>> I am about to submit a patch that solves it for another of the major areas:
>> regex matching (non-folded).  And I'm close to having a patch for the minor
>> area.
>>
>> If those patches are accepted, it will leave just one area left, and that is
>> qr/.../i.  I think it would be a very good thing if the whole problem could
>> be solved for 5.12.
>>
>> I want to throw out for comment the possibility that this could be solved
>> trivially by always using utf8 for case insensitive matching.
>>
>> Already blead does this if the regex has a trie (although from comments in
>> the code, the need for this might stem from the inconsistent behavior, which
>> I'm fixing, so it's possible that the new patch will allow tries to not have
>> to be utf8; I'm not sure.)
> 
> The problem comes from things like \xDF matching ( [sS][sS] | \xDF )
> (there is new one too but it isnt a problem as its a "high
> codepoint"). This means effectively that two tries have to be
> constructed, one for the non-unicode case, and one for the unicode
> case. A similar problem also comes up in character classes, in
> particular with logical operations like [^[:alnum:]abc] and things
> like that. Basically the idea was broken out of the box, just not
> obviously enough that it was clear that you cant hack around it. Even
> in the trie for a long time i thought supporting both in one structure
> was doable, now I do not.
> 
>> I was working on the case folding issue earlier this year, and found problem
>> after problem, bug after bug.  Some of these are fixed by going to utf8;
>> some are not.
> 
> I can imagine.
> 
>> It's pretty clear to me that people aren't using Perl for serious Unicode
>> work with case folding, or there would be a lot more bug reports on it than
>> there are.  Before I got distracted by fixing mktables, I was coming to the
>> idea that the current scheme of things just might not ever fully work for
>> code points that have multiple character folds--that possibility just never
>> was planned for in the original algorithm.
> 
> Maybe i misunderstand you, but....
> 
> Both streams are supposed to be foldcased as a normalizer and then
> compared. The trie logic handles this as far as I know properly. The
> matching algorithm was determined by the unicode folks so i dont see
> why it shouldnt work.
> 
> Or do you mean something else? Our character classes? In which case I
> can completely see your point. And jarkko long ago suggested to me
> that we should put effort into rewriting the char-class code.

I can't remember all the details now; and need to get into it again to 
reconstruct it.  I should have submitted a bug report.  I hope I've 
learned my lesson.

The part I remember is about char classes, and maybe that is the whole 
thing.  I started writing code around it.  One issue is that almost half 
the letters of the ASCII alphabet in 5.1 are whole or parts of folded 
utf8 characters.  E.g., f i is the fold for the ligature fi; k is a fold 
for the Kelvin symbol, etc.  When these are in char classes, they can 
get optimized out (I don't remember the details right now, but I have 
code that does) so that they just don't exist when a utf8 string comes 
along to be matched.
> 
> Cheers,
> Yves
> 
> 


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About