On 04/28/2011 05:27 PM, Karl Williamson wrote: > On 04/28/2011 02:59 PM, Tom Christiansen wrote: >> It's even weirder than that. Given: >> >> $\ = "\n"; >> my $x = "X-Xoqp-SDR-FpCqar4-Duooery-Faad-laeC_cCesspfpads:"; >> print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ || 0; >> print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i || 0; >> utf8::upgrade($x); >> print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ || 0; >> print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i || 0; >> >> Here are the results: >> >> % perl5.12.3 /tmp/bl >> 1 >> 1 >> 1 >> 1 >> >> % perl5.12.3 -M5.012 /tmp/bl >> 1 >> 1 >> 1 >> 1 >> >> % blead /tmp/bl >> 1 >> 1 >> 1 >> 0 >> >> % blead -M5.012 /tmp/bl >> 1 >> 0 >> 1 >> 0 >> >> So with (full) Unicode strings, it's yet again different still. >> >> With apologies to Philip K Dick :), this is a Karl-Thing, I think. >> >> --tom >> > > Fortunately for my ego, the problem isn't in my code. > > Unfortunately for the project (and perhaps to my ego), the problem is > much deeper; it is an issue with multi-character folds. The reason this > doesn't match in 5.14 when full Unicode semantics is on (with or without > utf8ness) is that in 5.14 for the first time, multi-char folds work. > > At least they work as designed. Perhaps there is a better design that > wouldn't have the gotcha this gives. I don't know, and am open to > suggestions. But Unicode is proposing to stop recommending that regular > expressions engines accept them. This proposal stemmed, at least in > part, to my pointing out issues to them about the feature. But their > proposal wasn't mounted until around the feature freeze time of 5.14, > after I had coded to what I thought were the correct specifications; and > the comment period for the proposal is still going -- it ends this > weekend. If you'd like to comment, see the document at > http://unicode.org/reports/tr18/proposed.html > After comments are over, they have to be evaluated, and will be > presented to May's meeting of the Unicode Technical Committee, and who > knows what will happen then. > > Let me quote from part of the motivation for the changes,found at > http://unicode.org/review/pri179 > > "There are a number of examples where the results would be > counter-intuitive for typical users of regular expressions." > > I think by "typical users" they mean anyone who doesn't have the mind of > a CPU. :) > > Anyway, what's going on here is that the regex appears to have been > designed to match the graphic ASCII characters except the colon. But it > is written so as to match the complement of the complement of those > characters, with case-insensitivity thrown in. That means it is supposed > to not match our old friend the German sharp ss, "ß". But that means it > is not supposed to match the case fold of ß because we have /i matching. > And that means it isn't supposed to match the sequence 'ss', which is > the case fold of ß. And that means the match fails at the point in the > above string where there is 'ss' in a row. > > That is counter-intuitive to me, but it is correct with the implemented > regex rules, and it seems to me to be correct according to what the > current Unicode TR18 says. Is there disagreement? > > What to do? I think this is a 5.14 blocker. And I'm thankful George > found it now and not later. I wish I had a really good idea of how to > proceed. My proposal, unless a better idea surfaces, is to just disable > multi-character folds in regex matching for 5.14, which is the direction > that Unicode appears to be moving. Multi-character folding worked > somewhat in earlier releases, but was extremely buggy, and could not be > relied on. Thus there are some backward compatibility issues, but we > might have to do that anyway if Unicode proceeds as expected. > It's actually quite easy to change the code to do this, as almost > everything is driven off a mktables generated table. We just have to > change a line or two in mktables to ignore the multi-char folds. There's > also a line or two in regcomp, as the fold for ß has a special case > there (for performance, to avoid having to look at the tables for > non-utf8 patterns.). There's plenty of code that just won't get > exercised, which could be #ifdef'd out, but that's not really necessary. > The bigger amount of work is fixing the .t's. And fixing the .pod'sThread Previous | Thread Next