develooper Front page | perl.perl5.porters | Postings from July 2013

Re: [perl #3634] Capture corruption through self-modying regexp(?{...})

Thread Previous | Thread Next
From:
Dave Mitchell
Date:
July 28, 2013 00:21
Subject:
Re: [perl #3634] Capture corruption through self-modying regexp(?{...})
Message ID:
20130728002123.GN2177@iabyn.com
On Sat, Jul 27, 2013 at 07:05:39AM -0700, Father Chrysostomos via RT wrote:
> On Thu Jun 14 15:13:18 2012, davem wrote:
> > On Thu, Jun 14, 2012 at 09:48:34AM -0700, Father Chrysostomos via RT
> wrote:
> > > On Thu Aug 03 18:02:21 2000, jfriedl@yahoo-inc.com wrote:
> > > > 
> > > >     #!/usr/local/bin/perl -w
> > > >     use strict;
> > > > 
> > > >     my $text = "a";
> > > >     $text =~ m/(.(?{ $text .= "x" }))*/;
> > > > 
> > > >     print "text is [$text]\n";
> > > >     print "length of text: ", length($text), "\n";
> > > >     print "starts: ", join('|', @-), "\n";
> > > >     print "ends  : ", join('|', @-), "\n";
> > > >     printf("length of match parts: [%d|%d|%d]\n", length($`),
> > > > length($&), length($'));
> > > >     printf("match itself: [%s|%s|%s]\n", map { defined($_) ? $_ : 'X'}
> > > > $`, $&, $');
> > > >     print "\$1[$1]\n";
> > > > 
> > > > prints (when piped through cat -v):
> > > > 
> > > >     text is [axxxxxxxxx]
> > > >     length of text: 10
> > > >     starts: 0|7
> > > >     ends  : 0|7
> > > >     length of match parts: [0|8|0]
> > > >     match itself: [|a^@^X@M-hd^O^H|X]
> > > >     $1[^H]
> > > 
> > > This is still a problem in bleadperl (c8d84f8c67a), even after Dave
> > > Mitchell’s jumbo re-eval rewrite.
> > 
> > Yep, that's the one ticket in the metaticket that's not fixed yet.
> 
> This appears to be fixed now, and I suspect it is because of
> PERL_NEW_COPY_ON_WRITE (meaning the bug is still present under
> -Accflags=-DPERL_NO_COW), but I haven’t checked.

The assertion failures stop with the following commit, according to
bisect, although I haven't looked closely to decide whether this
is actually the complete fix or whether anything still needs addresssing.

commit 7016d6ebb4afd4eb7b71b00f15b7515b5e45fee8
Author: David Mitchell <davem@iabyn.com>
Date:   Fri Sep 21 10:29:04 2012 +0100

    stop regex engine reading beyond end of string
    
    Historically the regex engine has assumed that any string passed to it
    will have a trailing null char. This isn't normally an issue in perl code,
    since perl strings *are* null terminated; but it could cause problems with
    strings returned by XS code, or with someone calling the regex engine
    directly from XS, with strend not pointing at a null char.
    
    The engine currently relies on there being a null char in the following
    ways.
    
    First, when at the end of string, the main loop of regmatch() still reads
    in the 'next' character (i.e. the character following the end of string)
    even if it doesn't make any use of it. This precludes using memory mapped
    files as strings for example, since the read off the end would SEGV.
    
    Second, the matching algorithm often required the trailing character to be
    \0 to work correctly: the test for 'EOF' was "if next char is null *and*
    locinput >= PL_regeol, then stop". So a random non-null trailing char
    could cause an overshoot.
    
    Thirdly, some match ops require the trailing char to be null to operate
    correctly; for example, \b applied at the end of the string only happens
    to work because the trailing char (\0) happens to match \W.
    
    Also, some utf8 ops will try to extract the code point at the end, which
    can result in multiple bytes past the end of string being read, and
    possible problems if they don't correspond to well-formed utf8.
    
    The main fix is in S_regmatch, where the 'read next char' code has been
    updated to set it to a special value, NEXTCHR_EOS instead, if we would be
    reading past the end of the string.
    
    Lots of other random bits in the regex engine needed to be fixed up too.
    
    To track these down, I temporarily hacked regexec_flags() to make a copy
    of the string but without trailing \0, then ran all the t/re/*.t tests
    under valgrind to flush out all buffer overruns. So I think I've removed
    most of the bad code, but by no means all of it. The code within the
    various functions in regexec.c is far too complex to be able to visually
    audit the code with any confidence.


-- 
You live and learn (although usually you just live).

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About