develooper Front page | perl.perl5.porters | Postings from June 2013

DAVEM TPF Grant May 2013 report

Thread Next
Dave Mitchell
June 3, 2013 15:06
DAVEM TPF Grant May 2013 report
Message ID:
This month was mostly spent on removing global state from the regex
engine, making re-entrantcy less error-prone. The extract from the merge
commit description below gives you all the details you could ever want.

Apart from that I spent a few hours re-enabling Copy-on_Write by default
post the 5.18.0 release, plus a few other bits and pieces.

It turns out that I have finally used up all the hours on my grant plus
extensions. I really must get round to applying for a new grant sometime

commit 7d75537ea64f99b6b8b8049465b6254f5d16c693
Merge: 3a74e0e 28d03b2
Author:     David Mitchell <>
AuthorDate: Sun Jun 2 20:59:58 2013 +0100

[MERGE] get rid of (most) regex engine global state

Historically, perl's regex engine was based on Henry Spencer's regex code,
which was all contained within a single file and used a bunch of static
variables to maintain the state of the current regex compile or execution.

This was perfectly adequate when only a single thread could execute a
regex, and where the regex engine couldn't be called re-entrantly.

In 5.0, these vars were promoted to be full global vars as perl became
embeddable; then in 5.5 they became part of the perl interpreter struct
when MULTIPLICITY was introduced.

In 5.6, the Perl_save_re_context() function was introduced that did a
whole bunch of SAVEPPTR(PL_bostr) type stuff, and was called in various
places where it was possible that the engine may be re-entered, to avoid
overwriting the global state of the currently executing regex. This was
particularly important now that Unicode had been introduced, and certain
character classes could trigger a call to the perl-level SWASH code, which
could itself execute a regex; and where /(?{ ... })/ code blocks could be
called which could do likewise.

In 5.10, The various PL_foo variables became fields within the new
re_save_state struct, and a new interpreter var, PL_reg_state, was
introduced which was of type struct re_save_state. Thus, all the
individual vars were still global state, but it became easier to save them
en-mass in Perl_save_re_context() by just copying the re_save_state stuct
onto the save stack and marking it with the new SAVEt_RE_STATE type.
Perl_save_re_context() was also expanded to be responsible for saving all
the current $1 values.

Up until now, that is roughly how things have remained, except for bug
fixes such as discovering more places where Perl_save_re_context() needs
to be called.

Note that, philosophically speaking at least, this is broken in two ways.
First, there's no good reason for the internal current state of the
executing regex engine to be stored in a bunch of global vars; and
secondly we're relying on potential callers of the regex engine (like the
magic tie code for example), to be responsible for being aware that they
*might* trigger re-entrancy in the regex engine, and to thus do
Perl_save_re_context() as a precaution. This is error-prone and hard to
prove correct.  (As an example, Perl_save_re_context() is only called in
the tie code if the tie code in question is doing a tied PRINT on STDERR;
clearly an unusual use case that someone spotted was buggy at some point).

The obvious fix, and the one performed by the series of commits in this
merge, is to make all the global state local to the regex engine instead.
Indeed, there is already a struct, regmatch_info, that is allocated as a
local var in regexec(), then passed as an argument to the various
lower-level functions called from regexec(). However, it only had limited
use previously, so here we expand the number of functions where it is
passed as an argument. In particular, it is now also created by
re_intuit_start(), the other main run-time entry point to the regex

However, there is a problem with this, in that various regex vars need
cleaning up on croak (e.g. they point to a malloced buffer). Since the
regmatch_info struct is just a local var on the C stack, it will be lost
by the longjmp done by a croak() before leave_scope() can clear up.

To handle this, some fields that logically should go in regmatch_info,
are instead added to two new structs: regmatch_info_aux and
regmatch_info_aux_eval; the first represents all the normal fields that
need some sort of cleanup handling; the second represents extra fields
(also possibly needing cleanup) that are only only needed if the pattern
contains /(?{})/ code blocks. These two structs are allocated in the next
two free PL_regmatch_state stack slots - since these slots are allocated
in 4K slabs anyway, they are essentially free of charge. A single
destructor function, S_cleanup_regmatch_info_aux() is then used with
SAVEDESTRUCTOR_X() to perform all cleanup at the end of execution.

In addition, all state and cleanup setup has been consolidated into a
single point near the start of regexec(); previously it was spread across
regexec(), regtry() and regmatch(). This used also to result in various
inefficencies, such as PL_regmatch_state stack freeing all higher unused
slabs at the end of each call to regmatch(), which might be called
multiple times by regexec(). Now it just frees once.

As part of this series of fixes it was necessary to change the API of
Perl_re_intuit_start(). This is because the API was broken: unlike
Perl_regexec_flags(), it didn't have a strbeg arg, and would try to guess
it from the SV (if any) passed to it. This could fail on overloaded SVs
for example, or where its called without an SV (not done from core, but
officially supported by the API). Note that this is likely to break
Re::Engine plugins, plus any code which directly calls intuit.

Finally, note that although struct re_save_state and SAVEt_RE_STATE are
gone, Perl_save_re_context() still does something useful: the equivalent
of local($1,$2...). Fixing that properly is a whole separate kettle of
fish, not addressed here.

As far as I'm aware, the only remaining global vars associated with the
regex engine are

    PL_reg_curpm, PL_regmatch_state, PL_regmatch_slab, PL_colors, PL_colorse

None of these are effected by re-entrancy. The state stack is, erm, a stack,
so it can handle re-entrancy quite happily, and the others are safe too.


Over the last month I have averaged 9.3 hours per week

As of 2013/05/31: since the beginning of the grant:

 168.7 weeks
1700.0 total hours
  10.1 average hours per week

There are 0 hours left on the grant.

Report for period 2013/05/01 to 2013/05/31 inclusive


    Effort (HH::MM):

        3:30 diagnosing bugs
       39:51 fixing bugs
        0:00 reviewing other people's bug fixes
        0:00 reviewing ticket histories
        0:00 review the ticket queue (triage)
       43:21 TOTAL

    Numbers of tickets closed:

           1 tickets closed that have been worked on
           0 tickets closed related to bugs that have been fixed
           0 tickets closed that were reviewed but not worked on (triage)
           1 TOTAL


33:29 [perl #114878] Regular Expression matching in signal handler
 2:00 [perl #116407] SvPVutf8 != SvPVX, and sv_2pvutf8
 4:33 [perl #116569] Re: 5.17.7 breaks rules of assignment
 1:49 [perl #117135] v5.17.9-80-g9f351b4 breaks SARTAK/Path-Dispatcher-1.04.tar.gz
 1:00 [perl #117917] Bug in Regex-Engine (?)
 0:30 [perl #118213] /$qr/p broken under 5.18.0

In economics, the exam questions are the same every year.
They just change the answers.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About