develooper Front page | perl.perl5.porters | Postings from January 2005

Interesting regex engine behaviour

Thread Next
From:
Marcus Holland-Moritz
Date:
January 23, 2005 13:12
Subject:
Interesting regex engine behaviour
Message ID:
20050123221203.15e731e1@r2d2
I just noticed the following strange regex "feature":

  #!/usr/bin/perl
  use Devel::Peek;
  use Convert::Binary::C;
  use re 'debug';
  
  $str = "\x42";
  $rv = Convert::Binary::C->new->pack('char', 0x42);
  
  print "[$rv][$str]\n";
  
  Dump $rv;
  Dump $str;
  
  $rv eq $str or warn "not equal\n";
  $rv =~ /^$str$/ or warn "no match\n";

This should either warn twice or never.
However, it warns once:

  mhx@r2d2 $ perl /tmp/test.pl 
  [B][B]
  SV = PV(0x814b518) at 0x8165718
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0x815e718 "B"
    CUR = 1
    LEN = 2
  SV = PV(0x814b4b8) at 0x815a51c
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0x8151e88 "B"\0
    CUR = 1
    LEN = 2
  Compiling REx `^B$'
  size 5 Got 44 bytes for offset annotations.
  first at 2
     1: BOL(2)
     2: EXACT <B>(4)
     4: EOL(5)
     5: END(0)
  anchored `B'$ at 0 (checking anchored) anchored(BOL) minlen 1 
  Offsets: [5]
          1[1] 2[1] 0[0] 3[1] 4[0] 
  Guessing start of match, REx `^B$' against `B'...
  Guessed: match at offset 0
  Matching REx `^B$' against `B'
    Setting an EVAL scope, savestack=3
     0 <> <B>               |  1:  BOL
     0 <> <B>               |  2:  EXACT <B>
     1 <B> <>               |  4:  EOL
                              failed...
  Match failed
  no match
  Freeing REx: `"^B$"'

The reason obviously seems to be that the scalar returned by
C::B::C->pack() isn't \0-terminated. This has never been a
problem, until I coded a regex similar to the above in one
of my test scripts.

It seems that, for reasons I don't know, the regex engine
requires the string to match against to be \0-terminated.

The code in question appears to be this snippet from regexec.c(2407):

    case SEOL:
      seol:
        if ((nextchr || locinput < PL_regeol) && nextchr != '\n')
            sayNO;
        if (PL_regeol - locinput > 1)
            sayNO;
        break;

I wonder what case the first 'nextchr' check is supposed to handle.
I've thrown that check out (also for the MEOL case):

    case MEOL:
        if (locinput < PL_regeol && nextchr != '\n')
            sayNO;
        break;
    case SEOL:
      seol:
        if (locinput < PL_regeol && nextchr != '\n')
            sayNO;
        if (PL_regeol - locinput > 1)
            sayNO;
        break;

All tests still pass. And:

  mhx@r2d2 $ bleadperl /tmp/test.pl 
  [B][B]
  SV = PV(0x8161560) at 0x81c7ee4
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0x81fefe8 "B"
    CUR = 1
    LEN = 2
  SV = PV(0x81614a0) at 0x816fb20
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0x81659a0 "B"\0
    CUR = 1
    LEN = 2
  Compiling REx `^B$'
  size 5 Got 44 bytes for offset annotations.
  first at 2
     1: BOL(2)
     2: EXACT <B>(4)
     4: EOL(5)
     5: END(0)
  anchored `B'$ at 0 (checking anchored) anchored(BOL) minlen 1 
  Offsets: [5]
          1[1] 2[1] 0[0] 3[1] 4[0] 
  Guessing start of match, REx `^B$' against `B'...
  Guessed: match at offset 0
  Matching REx `^B$' against `B'
    Setting an EVAL scope, savestack=3
     0 <> <B>               |  1:  BOL
     0 <> <B>               |  2:  EXACT <B>
     1 <B> <>               |  4:  EOL
     1 <B> <>               |  5:  END
  Match successful!
  Freeing REx: `"^B$"'

However, I have no idea whether this change would break
anything else. :-)

Perhaps someone with more knowledge of the regex engine
internals could comment on this...

Marcus
 
-- 
Those who do not understand Unix are condemned to reinvent it, poorly.
		-- Henry Spencer

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About