On 4/24/07, demerphq <demerphq@gmail.com> wrote: > The problem is that the optimiser thinks that /\xDF/i under unicode is > really 'ss' and therefore that the minimum length string that can > match is 2. Which obviously cases problems matching a latin-1 \xDF > which is only one byte. Amusingly another bug in the regex engine > allows this to work out ok when the string is unicode. utf8 \xDF is > two bytes long, and the regex engine has some issues with the > distinction between "byte length" and "codepoint length", so it sees > the two bytes of the single codepoint as being sufficient length, and > then uses unicode folding to convert the strings \xDF to 'ss' and > everything works out. But this is fluke, im positive that there are > other fold case scenarios where we cant rely on this bug saving the > day. If the fold case version was longer (in bytes) than the utf8 > version of the original it would not work out. [...] > At this point the only solution I can think of is to disable minlen > checks when a character is encountered that folds to a multi-character > string. Well i have a better solution it looks like. Ive created a new regop FOLDCHAR that will be used to handle the three problematic codepoints properly. This way the regex engine doesnt see them as normal text and therefore the optimiser can do the right thing and everything works out properly. Sigh, so much trouble for one character. (The other two are just bonus material) Its actually possible to detect codepoints that will have this problem so its probably smart to put something in mktables that will detect and warn if any new one come up. Or we can just do it by hand when updating the unicode data files. Patch is attached. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"Thread Previous | Thread Next