DAVEM TPF Grant July, August 2012 report

Dave Mitchell
September 5, 2012 04:50
(This report covers two months, as I appear to have completely forgotten
to send a report last month!)

As per my grant conditions, here is a report for the July/August period.

I spent a bit of time fixing a few issues causes by my rewriting of the
/(?{})/ implementation, then started to look into the last unclosed ticket
still attached to the re_eval meta-ticket. This concerns code within
(?{}) that modifies the string being matched against, and generally causes
assertion failures or coredumps:

    my $text = "a"; $text =~ m/(.(?{ $text .= "x" }))*/;

While trying to understand what's going on, I ended up delving into the
issue of how and when perl makes a copy of the string buffer in order to
make $1, $& etc continue to show the right value even if the string is
subsequently changed. It turns out that in some circumstances this can
have a huge performance penalty. For example the following code takes
several minutes to run, since it mallocs and copies a 1Mb buffer a million

    $_ = 'x' x 1_000_000;
    1 while /(.)/g;

If you remove the $&, it runs fast (<1s), but this is only because pp_match
has a special hack added that says "even if the pattern contains captures,
in the presence of /g don't bother copying the string buffer". So the
following prints zzz rather than aaa. And if the string buffer gets
realloced in the meantime, it could print out garbage:

    $_ = 'aaa';
    $_ = 'zzz';
    print "[$1]\n";

Attempts to fix this in the past have tried to implement some sort of
Copy-On-Write behaviour, but have come up against the difficulty of
making an SV always honour COW in all circumstances and/or not making the
SV itself "unusual". Also, the regex engine API itself matches against a
string buffer not an SV, so you aren't guaranteed to always have a valid
SV to mess with.

My approach to this has been to only copy the substring of the string
buffer needed to cover $1,$&, etc. The mechanism  (PL_sawampersand) used
to detect whether $`,$&,$' have been seen in code has been updated to log
each of the three variables separately. The code then uses the index
range of any captures, plus which of  $`,$&,$' are present, plus the
presence or not of /p, to decide what part of the string to copy.
In the case of 

    $_ = 'x' x 1_000_000;
    1 while /(.)/g;

(with or without $&), the range is a single byte rather than a Mb that
gets copies a million times, and now runs in subsecond time. This means
that the hack can be removed, and printing $1 no longer risks a segfault.

It also means that having just $& in your source code may no longer
necessarily be the huge performance hog it used to be, although
having $` and $' too will drag things down to previous levels.

In summary:

    $_ = 'x' x 1_000_000; 1 while /(.)/g;

before: fast and segfaulty
now:    fast and non-segfaulty

    $_ = 'x' x 1_000_000; 1 while /(.)/g;

before: slow and non-segfaulty
now:    fast and non-segfaulty

This is all working and tested, but hasn't been pushed out for
smoking/merging yet, since I haven't yet fixed the *original* bug yet,
i.e.  the 

    my $text = "a"; $text =~ m/(.(?{ $text .= "x" }))*/;

Over the last two months I have averaged 6 hours per week :-(.

As of 2012/08/31: since the beginning of the grant:

 129.7 weeks
1353.2 total hours
  10.4 average hours per week

There are 343 hours left on the grant.

Report for period 2012/07/01 to 2012/08/31 inclusive


    Effort (HH::MM):

        6:25 diagnosing bugs
       46:45 fixing bugs
        0:00 reviewing other people's bug fixes
        0:00 reviewing ticket histories
        0:00 review the ticket queue (triage)
       53:10 TOTAL

    Numbers of tickets closed:

           3 tickets closed that have been worked on
           0 tickets closed related to bugs that have been fixed
           0 tickets closed that were reviewed but not worked on (triage)
           3 TOTAL


45:00 [perl #3634] Capture corruption through self-modying regexp (?{...})
 3:00 [perl #114242] TryCatch toke error with (??{$any})  $ws \\] )? @
 1:00 [perl #114302] Bleadperl v5.17.0-408-g3c13cae breaks DGL/re-engine-RE2-0.10.tar.gz
 2:10 [perl #114356] REGEXPs have massive reference counts
 2:00 [perl #114378] cond_signal does not wake up a thread

Please note that ash-trays are provided for the use of smokers,
whereas the floor is provided for the use of all patrons.
    -- Bill Royston

