demerphq wrote: > 2009/12/7 karl williamson <public@khwilliamson.com>: >> I have been trying to solve the discrepancies involving the semantics being >> different when a scalar is stored in utf8 or not. >> >> To review, there are 3 major and 1 very minor known areas where this occurs. >> Blead already contains a fix for one of the major areas: case changing via >> uc() and its cousins. >> >> I am about to submit a patch that solves it for another of the major areas: >> regex matching (non-folded). And I'm close to having a patch for the minor >> area. >> >> If those patches are accepted, it will leave just one area left, and that is >> qr/.../i. I think it would be a very good thing if the whole problem could >> be solved for 5.12. >> >> I want to throw out for comment the possibility that this could be solved >> trivially by always using utf8 for case insensitive matching. >> >> Already blead does this if the regex has a trie (although from comments in >> the code, the need for this might stem from the inconsistent behavior, which >> I'm fixing, so it's possible that the new patch will allow tries to not have >> to be utf8; I'm not sure.) > > The problem comes from things like \xDF matching ( [sS][sS] | \xDF ) > (there is new one too but it isnt a problem as its a "high > codepoint"). This means effectively that two tries have to be > constructed, one for the non-unicode case, and one for the unicode > case. A similar problem also comes up in character classes, in > particular with logical operations like [^[:alnum:]abc] and things > like that. Basically the idea was broken out of the box, just not > obviously enough that it was clear that you cant hack around it. Even > in the trie for a long time i thought supporting both in one structure > was doable, now I do not. > >> I was working on the case folding issue earlier this year, and found problem >> after problem, bug after bug. Some of these are fixed by going to utf8; >> some are not. > > I can imagine. > >> It's pretty clear to me that people aren't using Perl for serious Unicode >> work with case folding, or there would be a lot more bug reports on it than >> there are. Before I got distracted by fixing mktables, I was coming to the >> idea that the current scheme of things just might not ever fully work for >> code points that have multiple character folds--that possibility just never >> was planned for in the original algorithm. > > Maybe i misunderstand you, but.... > > Both streams are supposed to be foldcased as a normalizer and then > compared. The trie logic handles this as far as I know properly. The > matching algorithm was determined by the unicode folks so i dont see > why it shouldnt work. > > Or do you mean something else? Our character classes? In which case I > can completely see your point. And jarkko long ago suggested to me > that we should put effort into rewriting the char-class code. I can't remember all the details now; and need to get into it again to reconstruct it. I should have submitted a bug report. I hope I've learned my lesson. The part I remember is about char classes, and maybe that is the whole thing. I started writing code around it. One issue is that almost half the letters of the ASCII alphabet in 5.1 are whole or parts of folded utf8 characters. E.g., f i is the fold for the ligature fi; k is a fold for the Kelvin symbol, etc. When these are in char classes, they can get optimized out (I don't remember the details right now, but I have code that does) so that they just don't exist when a utf8 string comes along to be matched. > > Cheers, > Yves > >Thread Previous | Thread Next