2009/12/10 karl williamson <public@khwilliamson.com>: > I can't remember all the details now; and need to get into it again to > reconstruct it. I should have submitted a bug report. I hope I've learned > my lesson. > > The part I remember is about char classes, and maybe that is the whole > thing. I started writing code around it. One issue is that almost half the > letters of the ASCII alphabet in 5.1 are whole or parts of folded utf8 > characters. E.g., f i is the fold for the ligature fi; k is a fold for the > Kelvin symbol, etc. When these are in char classes, they can get optimized > out (I don't remember the details right now, but I have code that does) so > that they just don't exist when a utf8 string comes along to be matched. Rght, charclasses being broken in some unicode contexts does not surprise me, the CC algorithm is not designed to handle multi-codepoint folding, and is quite inefficient when operating on unicode charclasses. An AWESOME project for someone with tuits and an interest would be to implement one of the other data structures for charclasses, like skiplists, and possibly use the trie for tricky folding scenarios. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"Thread Previous | Thread Next