develooper Front page | perl.perl5.porters | Postings from February 2015

Re: [perl #123918] regex end of line match very slow

Thread Previous | Thread Next
From:
Dave Mitchell
Date:
February 25, 2015 14:37
Subject:
Re: [perl #123918] regex end of line match very slow
Message ID:
20150225143650.GI28599@iabyn.com
On Tue, Feb 24, 2015 at 11:26:31AM -0800, Paul Salazar wrote:
> Running the following regex matching code is very slow when matching
> long strings for a positive match, but very fast for a negative match.
> 
> use Time::HiRes qw(gettimeofday);
> 
> my $iters = 100000;
> my $x; my $strt = 0;
> 
> my $strng = ''; my $dat = '123';
> my $reg = "\\s\$";
> 
> print "Running match $reg on $iters iterations on concatenated string...\n";
> 
> $strt = gettimeofday();
> for ($x = 0; $x < $iters; $x++) {
>     $strng .= ($strng =~ m/$reg/o ? ' ' : '') . $dat . ' ';
> }
> print "Time: ".(gettimeofday() - $strt)."\n";
> 
> $strng = ''; $dat = '123';
> $reg = "\\S\$";
> 
> print "\nRunning match $reg on $iters iterations on concatenated string...\n";
> 
> $strt = gettimeofday();
> for ($x = 0; $x < $iters; $x++) {
>     $strng .= ($strng =~ m/$reg/o ? ' ' : '') . $dat . ' ';
> }
> print "Time: ".(gettimeofday() - $strt)."\n";
> 
> 
> The output is as follows:
> 
> 
> Running match \s$ on 100000 iterations on concatenated string...
> Time: 6.53281188011169
> 
> Running match \S$ on 100000 iterations on concatenated string...
> Time: 0.038999080657959
> 
> 
> This problem does not exist in perl 5.12. I'm not sure when it was
> introduced into core library. I discovered it when an app using
> PDF::API2 starting going really slow after upgrading to 5.20 strawberry.

It's a side-effect of copy-on-write (COW) strings, which first appeared in
5.20.0. Traditionally in the presence of captures or $`/$&/$', a
successful match would make a copy of the just-matched string. Without
their presence, the copy would be skipped, which unfortunately made things
like

    $s =~ /abc/;
    .... modify $s ...;
    $x = eval '$&';

leave $x containing garbage.

With COW, a successfully matched string is marked as COW, with the regex
holding a second pointer to the string.  If the string is subsequently
modified, the string is copied (hence the C in COW). Given that the
string is being repeatedly extended in small increments, physically coping
the whole string each time round the loop ends up going quadratic.

Conversely in the days before COW, something like 

    $s = 'x' x 1_000_000;
    $&;
    1 while $s =~ /./g;

would go quadratic on length.

So pre-COW:

    the regex engine *always* copied after a successful match in the presence
    of $& etc, going quadratic always, regardless of whether the string is
    subsequently modified or not; (it actually didn't copy in the just
    presence of captures, and just returned garbage in $1 etc if the
    string was modified; this was a tradeoff of performance over
    correctness).

Post-COW:

    on a successful match the string is marked as COW, but no physical
    copy is done. $1, $& etc will always be correct regardless of whether
    the string is subsequently modified, and regardless of whether $& had
    been seen prior to the regex being executed. However, if the string
    is subsequently modified, a full copy is done.

I can't yet thing of any elegant solution for this.

Some theoretical workarounds that don't actually work are:

* 5.21.8 and later include the /n modifier, which doesn't set $1 etc,
but it appears to be buggy in that it seems to still make the string COW.

* Perform another successful match on a different string after the first
match but before modifying the string, e.g.

    $long_string =~ /.../;
    "a" =~ /./;
    modify $long_string;

In theory that changes the "current regex" (PL_curpm) which is used
to calculate the values of $1 etc on the fly. Unfortunately when it is
changed, the old regex still holds a pointer to the COWed string;
perhaps we should look into dropping a COW reference each time PL_curpm
is updated????


-- 
Dave's first rule of Opera:
If something needs saying, say it: don't warble it.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About