develooper Front page | perl.perl5.porters | Postings from October 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
From:
karl williamson
Date:
October 25, 2008 07:32
Subject:
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
49032DF6.6070507@khwilliamson.com
Tom Christiansen wrote:
> Karl,
> 
> Thanks very much for the patch.
> 
> I confess guess I never *expected* "\400" to be "\x{100}", 
> but rather "\x{20}0".   
I'm not wedded to the interpretation I sumitted.  It came out of testing 
  against the documentation that a blackslash in a regular expression 
can be followed by 1-3 octal digits (the first one not needing to be a 
0).  And the interpretation I submitted is the one that the pre-existing 
code meant.  Witness

  perl -le 'print "\x{100}\400" =~ /\x{100}\400/'
  perl -le 'print "\400\x{100}" =~ /\400\x{100}/'
both print 1.

That is, If any other character anywhere in the pattern causes regcomp 
to think that it should store the re as utf8, then \400 matches \400 and 
so on up to the maximum 3 digit octal: \777 matches \777.  Failing that, 
/\400/ will match \01\00.  It never matches "\x{20}0".

So we have an existing bug.  sometimes \400 matches \400, and sometimes 
it matches \01\00, depending on what I would call spooky action at a 
distance.  (This means that \777 sometimes already matches \777 now.) 
I'm trying to get rid of these consistencies.  I think something should 
be done here, but perhaps its not what I thought it should be.  My patch 
follows what the code was intending to do, but perhaps we should change 
that intention.  Please guide me.

> 
> However, I'm a bit concerned about perl -0777, as it's documented to 
> 
>     The special value 00 will cause Perl to slurp files in paragraph 
>     mode.  The value 0777 will cause Perl to slurp files whole because 
>     there is no legal byte with that value.
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
>     Either put all the switches after the 32-character boundary 
>     (if applicable), or replace the use of B<-0>I<digits> by 
>     C<BEGIN{ $/ = "\0digits"; }>.
>                    ^^^^^^^^
> 
> While your patch doesn't seem to affect that, it might lead one to
> question differing behaviors in different places.
> 
>     regexp: \777 means naughty
>     string: \777 means... um what?
>     CLI:    \777 means undef $/
> 
> And so I wonder what's to be done about that.  
> 
> But perhaps I under-understand?
> 
> --tom
> 
>     --- regcomp.c.orig      2008-10-18 12:16:42.000000000 -0600
>     +++ regcomp.c   2008-10-24 10:22:24.000000000 -0600
>     @@ -7417,6 +7417,7 @@
>                                   I32 flags = 0;
>                                 STRLEN numlen = 3;
>                                 ender = grok_oct(p, &numlen, &flags, NULL);
>     +                           if (ender > 0xff) RExC_utf8 = 1;
>                                 p += numlen;
>                             }
>                             else {
>     --- t/op/re_tests.orig  2008-09-22 14:42:42.000000000 -0600
>     +++ t/op/re_tests       2008-10-24 10:51:35.000000000 -0600
>     @@ -1357,3 +1357,8 @@
>       /^\s*i.*?o\s*$/s      io\n io y       -       -
>       # As reported in #59168 by Father Chrysostomos:
>       /(.*?)a(?!(a+)b\2c)/  baaabaac        y       $&-$1   baa-ba
>     +
>     +# #59342
>     +/\377/ \377    y       $&      \377
>     +/\400/ \400    y       $&      \400
>     +/\777/ \777    y       $&      \777
> 
> 


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About