develooper Front page | perl.perl5.porters | Postings from October 2012

DAVEM TPF Grant September 2012 report

Thread Next
Dave Mitchell
October 1, 2012 09:32
DAVEM TPF Grant September 2012 report
Message ID:
As per my grant conditions, here is a report for the September period.

This month I mostly spent my time:

1) Making regexes only copy the needed part of a string.

In general when a regex is executed, and it is seen that captures exist
(either explicitly with /(...)/ or implicitly in the presence of $`, $& or
$'), then perl makes a copy of the string that was matched against, so
that subsequent changes to the string won't affect $1 etc.

That makes things like the following work:

    $_ = "abc";
    $_ = 123";
    ok($_ eq "a");

However in the following code, the copying on each match causes the match
to go quadratic, copying a 1Mb string a million times:

    $_ = 'x' x 1_000_000;
    while (/(.)/) { ... }

So in this case perl has always "optimised" that particular construct
(//g in scalar context) to skip the string copying. This makes it fast but
with potentially incorrect $1 etc values.

In the presence of $& etc, this "optimisation" is skipped, making the
code slow but correct.

My work did two things. First, it split the detection of $`, $& and $&
into three separate flags, so having just one of the three in your code
won't necessarily do as much damage as before; secondly, when the string
buffer is copied, only a single substring of it is copied: enough to
cover the union of $1, $2, ... and whichever of $`, $& and $' have been
seen. With this in place, this code

    $_ = 'x' x 1_000_000;
    while (/(.)/) { ... }

(with or without $& present) will just copy a single character each time;
so it's now both fast and correct. If $` say were present too, then it
would be a lot slower, copying an incrementally larger string each time.

2) Making the regex engine handle non-null-terminated strings.

Traditionally, strings created from within perl have a hidden null byte
at the end to make it easier to pass them to routines that expect C-style
null-terminated strings. However, strings from other sources (such as XS)
may not. It turns out that from its beginnings in perl 3, the engine has
always worked on the assumption that any string will have an extra zero
byte at the end, and relies on this for a number of things, from simply
stopping at the right point, to various match types such as \b only
working correctly at string ends due to this hidden byte.

I've overhauled the engine so that as far as I'm aware, it no longer
accesses any byte after the end of the string for any reason. It turns out
that the code which failed this assumption was fairly pervasive, and
there's no guarantee that I got everything. I do know however that
none of the tests under t/re/ trigger an overrun, since I ran valgrind on
all those tests with a modified engine that realloced every string it was
about to match against.

Note that both (1) and (2) were necessary prerequisites to fixing the
self-modifying string bug:

    $s =~ /.({ $s .= 'x' })/

However, I've put off working on a final fix for that for while, as I've
got a bit sick and tired of working the regex engine, and turned my
attention to a bit of optimising:

3) Turn multiple PAD ops into a single PADRANGE op

Every declaration or use of a 'my' lexical variable involves the execution
of a PADSV, PADAV or PADHV op. Since these often come in groups with
targets in a consecutive range, it makes sense under some circumstances to
collect them all into a single op (which will also carry out some
ancilliary tasks). For example, the simple declaration:

    my ($a,@b,%c);

is currently compiled into 5 ops:


With my current work, this now compiles into a single op:


As well as being more efficient in general by avoiding the overhead of
executing several lightweight ops, in this particular case (my declaration
in void context), it skips altogether pushing those three vars onto the
stack then popping them again, which is what used to pointlessly happen.

Although not done yet, there is scope for further optimisations; for
example the 'my' declaration above causes a SAVECLEARSV to be pushed onto
the save stack by each of the PAD ops. My padrange code still does 3
separate SAVECLEARSV pushes in a loop, but that could be replaced with a
single SAVECLEARSV_RANGE save type, saving time on save and restore.


Over the last month I have averaged 22 hours per week
As of 2012/09/30: since the beginning of the grant:

 134.0 weeks
1442.2 total hours
  10.8 average hours per week

There are 258 hours left on the grant.

Report for period 2012/09/01 to 2012/09/30 inclusive


    Effort (HH::MM):

        0:00 diagnosing bugs
       89:00 fixing bugs
        0:00 reviewing other people's bug fixes
        0:00 reviewing ticket histories
        0:00 review the ticket queue (triage)
       89:00 TOTAL

    Numbers of tickets closed:

           0 tickets closed that have been worked on
           0 tickets closed related to bugs that have been fixed
           0 tickets closed that were reviewed but not worked on (triage)
           0 TOTAL


57:40 [perl #3634] Capture corruption through self-modying regexp (?{...})
31:20 [perl #114536] Optimize assigning to scalars from @_

"I do not resent criticism, even when, for the sake of emphasis,
it parts for the time with reality".
    -- Winston Churchill, House of Commons, 22nd Jan 1941.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About