develooper Front page | perl.perl5.porters | Postings from October 2018

regex compilation sizing pass removed in 5.29.4

Karl Williamson
October 20, 2018 16:28
regex compilation sizing pass removed in 5.29.4
Message ID:
commit 7c932d07cab18751bfc7515b4320436273a459e2
Author: Karl Williamson <>
Date:   Fri Oct 19 09:48:34 2018 -0600

     Remove sizing pass from regular expression compiler

This means that in many instances, patterns will be compiled in a single 
pass.  This change majorly affects how almost all perl programs compile. 
  It is supposed to be transparent, compiling to the same engine as 
before, but, of course, there may be bugs or unintended consequences, 
and it may unleash a slew of BBC reports, which can be viewed as 
opportunities to improve our test suite coverage. ;)

It has long been a goal of Yves and me and others to remove the sizing 
pass.  Though I think the real goal is to turn the pattern generation 
into an AST type thing, which this commit decidedly does not do.  It is, 
however, a step in that direction.

The full commit message is below, but I want to stress that I did this 
now because I came to the realization that having a sizing pass that has 
bugs that subtly compute too small a size than is necessary turns garden 
variety bugs into CVEs.

Here is the commit message:

This commit removes the sizing pass for regular expression compilation.
It attempts to be the minimum required to do this.  Future patches are
in the works that improve it,, and there is certainly lots more that
could be done.

This is being done now for security reasons, as there have been several
bugs leading to CVEs where the sizing pass computed the size improperly,
and a crafted pattern could allow an attack.  This means that simple
bugs can too easily become attack vectors.

This is NOT the AST that people would like, but it should make it easier
to move the code in that direction.

Instead of a sizing pass, as the pattern is parsed, new space is
malloc'd for each regnode found.  To minimize the number of such mallocs
that actually go out and request memory, an initial guess is made, based
on the length of the pattern being compiled.  The guessed amount is
malloc'd and then immediately released.  Hopefully that memory won't be
gobbled up by another process before we actually gain control of it.
The guess is currently simply the number of bytes in the pattern.
Patches and/or suggestions are welcome on improving the guess or this

This commit does not mean, however, that only one pass is done in all
cases.  Currently there are several situations that require extra
passes.  These are:

     a)  If the pattern isn't UTF-8, but encounters a construct that
         requires it to become UTF-8, the parse is immediately stopped,
         the translation is done, and the parse restarted.  This is
         unchanged from how it worked prior to this commit.
     b)  If the pattern uses /d rules and it encounters a construct that
         requires it to become /u, the parse is immediately stopped and
         restarted using /u rules.  A future enhancement is to only
         restart if something has been encountered that would generate
         something different than what has already been generated, as
         many operations are the same under both /d and /u.  Prior to
         this commit, in rare circumstances was the parse immediately
         restarted.  Only those few that changed the sizing did so.
         Instead the sizing pass was allowed to complete and then the
         generation pass ran, using /u.  Some CVEs were caused by faulty
         implementation here.
     c)  Very large patterns may need to have long jumps in their
         program.  Prior to this commit, that was determined in the
         sizing pass, and all jumps were made long during generation.
         Now, the first time the need for a long jump is detected, the
         parse is immediately restarted, and all jumps are made long.  I
         haven't investigated enough to be sure, but it might be
         sufficient to just redo the current jump, making it long, and
         then switch to using just long jumps, without having to restart
         the parse from the beginning.
     d)  If a reference that could be to capturing parentheses doesn't
         find such parentheses, a flag is set.  For references that could
         be octal constants, they are assumed to be those constants
         instead of a capturing group.  At the end of the parse, if the
         flag indicates either that the assumption(s) were wrong or that
         it is a fatal reference to a non-existent group, the pattern is
         reparsed knowing the total number of these groups.
     e)  If (?R) or (?0) are encountered, the flag listed in item d)
         above is set to force a reparse.  I did have code in place that
         avoided doing the reparse, but I wasn't confident enough that
         our test suite exercises that area of the code enough to have
         exposed all the potential interaction bugs, and I think this
         construct is used rarely enough to not worry about avoiding the
         reparse at this point in the development.
     f)  If (?|pattern) is encountered, the behavior is the same as in
         item e) above.  The pattern will end up being reparsed after the
         total number of parenthesized groups are known.  I decided not
         to invest the effort at this time in trying to get this to work
         without a reparse.

It might be that if we are continuing the parse to just count
parentheses, and we encounter a construct that normally would restart
the parse immediately, that we could defer that restart.  This would cut
down the maximum number of parses required.  As of this commit, the
worst case is we find something that requires knowing all the
parentheses; later we have to switch to /u rules and so the parse is
restarted.  Still later we have to switch to long jumps, and the parse
is restarted again.  Still later we have to upgrade to UTF-8, and the
parse is restarted yet again.  Then the parse is completed, and the
final total of parentheses is known, so everything is redone a final
time.  Deferring those intermediate restarts would save a bunch of

Prior to this commit, warnings were issued only during the code
generation pass, which didn't get called unless the sizing pass(es)
completed successfully.  But now, we don't know if the pass will
succeed, fail, or whether it will have to be restarted.  To avoid
outputting the same warning more than once, the position in the parse of
the last warning generated is kept (across parses).  The code looks at
that position when it is about to generate a warning.  If the parsing
has previously gotten that far, it assumes that the warning has already
been generated, and suppresses it this time.  The current state of
parsing is such that I believe this assumption is valid.  If the parses
had divergent paths, that assumption could become invalid. Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About