develooper Front page | perl.perl5.porters | Postings from July 2013

Re: refactoring of regex execution / calling

Thread Previous | Thread Next
From:
demerphq
Date:
July 31, 2013 00:37
Subject:
Re: refactoring of regex execution / calling
Message ID:
CANgJU+ULYVhq84S26nRb39eu8hgKSrxfXqCRkveKYmFJfWdsMg@mail.gmail.com
On 30 July 2013 22:20, Dave Mitchell <davem@iabyn.com> wrote:
> I pushed this merge commit a couple of days ago. It's fairly
> self-explanatory. It was originally an attempt to fix intuit-only matches
> under COW, and grew into a 50 commit monster.
>
> commit e82485c19c70d922047c43d035a5e59a7c08ce67
> Merge: 8088f39 2bfbe30
> Author:     David Mitchell <davem@iabyn.com>
> AuthorDate: Sun Jul 28 14:09:44 2013 +0100
> Commit:     David Mitchell <davem@iabyn.com>
> CommitDate: Sun Jul 28 14:09:44 2013 +0100
>
>     [MERGE] refactor pp_match(), pp_subst(), regexec()
>
>     Notionally the regexec engine has a well-defined API.
>     In practice, the caller of regexec() (typically pp_match() or pp_subst()),
>     is required to do a lot of set-up before calling regexec(), and some
>     post-processing afterwards; in particular to handle \G, to handle intuit,
>     and to set up $& correctly after an intuit-only match.
>
>     The series of commits in this branch refactors the code around these three
>     functions so that all the regex "knowledge"  is now contained within
>     regexec() rather than in the calling pp functions. At the same, time the
>     pp functions have been heavily cleaned up and simplified where possible.
>     This reduces the LOC in pp_match() from 305 to 186.
>
>     The most visible refactorisation changes are that:
>
>     * the call to intuit is now done from regexec() rather than from pp*;
>
>     * ditto the setting of $& on intuit-only matches;
>
>     * all the extra setup for \G is now in a single block of code in regexec(),
>       rather than being distributed haphazardly across all 3 functions;
>
>     Along the way various things have been improved and bugs have been fixed:
>
>     * intuit-only matches had been inadvertently disabled when COW was enabled;
>       this now fixed. (An intuit-only match is where intuit finding a suitable
>       start position is sufficient to determine that the pattern has matched,
>       e.g. a fixed string pattern /abc/ without captures);
>
>     * intuit-only substitutions had never been enabled; they are now;
>       e.g /s/foo/bar/g
>
>     * formerly, intuit was skipped in the presence of anchored \G; this is no
>       longer the case, so that something like "aaaa" =~ /\G.*xx/ now fails
>       quickly due to the missing "xx";
>
>     * the COW code will try to reuse the COW copy SV on subsequent captures on
>       the same regex and string, rather than freeing and reallocating.
>
>     * substitutions will no longer permit themselves to iterate "backwards",
>       e.g. with s/.(?=.\G)/x/g;
>
>     * some obscure utf8 issues with s/// have been fixed;
>
>     * some bugs with \G fixed (and probably new ones added)

Havent looked at the patch yet, but this mail fills me with warmth and joy.

Thanks Dave.

Yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About