develooper Front page | perl.perl5.porters | Postings from July 2013

refactoring of regex execution / calling

Thread Next
Dave Mitchell
July 30, 2013 20:21
refactoring of regex execution / calling
Message ID:
I pushed this merge commit a couple of days ago. It's fairly
self-explanatory. It was originally an attempt to fix intuit-only matches
under COW, and grew into a 50 commit monster.

commit e82485c19c70d922047c43d035a5e59a7c08ce67
Merge: 8088f39 2bfbe30
Author:     David Mitchell <>
AuthorDate: Sun Jul 28 14:09:44 2013 +0100
Commit:     David Mitchell <>
CommitDate: Sun Jul 28 14:09:44 2013 +0100

    [MERGE] refactor pp_match(), pp_subst(), regexec()
    Notionally the regexec engine has a well-defined API.
    In practice, the caller of regexec() (typically pp_match() or pp_subst()),
    is required to do a lot of set-up before calling regexec(), and some
    post-processing afterwards; in particular to handle \G, to handle intuit,
    and to set up $& correctly after an intuit-only match.
    The series of commits in this branch refactors the code around these three
    functions so that all the regex "knowledge"  is now contained within
    regexec() rather than in the calling pp functions. At the same, time the
    pp functions have been heavily cleaned up and simplified where possible.
    This reduces the LOC in pp_match() from 305 to 186.
    The most visible refactorisation changes are that:
    * the call to intuit is now done from regexec() rather than from pp*;
    * ditto the setting of $& on intuit-only matches;
    * all the extra setup for \G is now in a single block of code in regexec(),
      rather than being distributed haphazardly across all 3 functions;
    Along the way various things have been improved and bugs have been fixed:
    * intuit-only matches had been inadvertently disabled when COW was enabled;
      this now fixed. (An intuit-only match is where intuit finding a suitable
      start position is sufficient to determine that the pattern has matched,
      e.g. a fixed string pattern /abc/ without captures);
    * intuit-only substitutions had never been enabled; they are now;
      e.g /s/foo/bar/g
    * formerly, intuit was skipped in the presence of anchored \G; this is no
      longer the case, so that something like "aaaa" =~ /\G.*xx/ now fails
      quickly due to the missing "xx";
    * the COW code will try to reuse the COW copy SV on subsequent captures on
      the same regex and string, rather than freeing and reallocating.
    * substitutions will no longer permit themselves to iterate "backwards",
      e.g. with s/.(?=.\G)/x/g;
    * some obscure utf8 issues with s/// have been fixed;
    * some bugs with \G fixed (and probably new ones added)

Indomitable in retreat, invincible in advance, insufferable in victory
    -- Churchill on Montgomery

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About