Tom Christiansen wrote: > Karl, > > Thanks very much for the patch. > > I confess guess I never *expected* "\400" to be "\x{100}", > but rather "\x{20}0". I'm not wedded to the interpretation I sumitted. It came out of testing against the documentation that a blackslash in a regular expression can be followed by 1-3 octal digits (the first one not needing to be a 0). And the interpretation I submitted is the one that the pre-existing code meant. Witness perl -le 'print "\x{100}\400" =~ /\x{100}\400/' perl -le 'print "\400\x{100}" =~ /\400\x{100}/' both print 1. That is, If any other character anywhere in the pattern causes regcomp to think that it should store the re as utf8, then \400 matches \400 and so on up to the maximum 3 digit octal: \777 matches \777. Failing that, /\400/ will match \01\00. It never matches "\x{20}0". So we have an existing bug. sometimes \400 matches \400, and sometimes it matches \01\00, depending on what I would call spooky action at a distance. (This means that \777 sometimes already matches \777 now.) I'm trying to get rid of these consistencies. I think something should be done here, but perhaps its not what I thought it should be. My patch follows what the code was intending to do, but perhaps we should change that intention. Please guide me. > > However, I'm a bit concerned about perl -0777, as it's documented to > > The special value 00 will cause Perl to slurp files in paragraph > mode. The value 0777 will cause Perl to slurp files whole because > there is no legal byte with that value. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Either put all the switches after the 32-character boundary > (if applicable), or replace the use of B<-0>I<digits> by > C<BEGIN{ $/ = "\0digits"; }>. > ^^^^^^^^ > > While your patch doesn't seem to affect that, it might lead one to > question differing behaviors in different places. > > regexp: \777 means naughty > string: \777 means... um what? > CLI: \777 means undef $/ > > And so I wonder what's to be done about that. > > But perhaps I under-understand? > > --tom > > --- regcomp.c.orig 2008-10-18 12:16:42.000000000 -0600 > +++ regcomp.c 2008-10-24 10:22:24.000000000 -0600 > @@ -7417,6 +7417,7 @@ > I32 flags = 0; > STRLEN numlen = 3; > ender = grok_oct(p, &numlen, &flags, NULL); > + if (ender > 0xff) RExC_utf8 = 1; > p += numlen; > } > else { > --- t/op/re_tests.orig 2008-09-22 14:42:42.000000000 -0600 > +++ t/op/re_tests 2008-10-24 10:51:35.000000000 -0600 > @@ -1357,3 +1357,8 @@ > /^\s*i.*?o\s*$/s io\n io y - - > # As reported in #59168 by Father Chrysostomos: > /(.*?)a(?!(a+)b\2c)/ baaabaac y $&-$1 baa-ba > + > +# #59342 > +/\377/ \377 y $& \377 > +/\400/ \400 y $& \400 > +/\777/ \777 y $& \777 > >Thread Previous | Thread Next