develooper Front page | perl.perl5.porters | Postings from June 2013

Re: Perl 5.18 and Regexp::Grammars

Thread Previous | Thread Next
From:
Damian Conway
Date:
June 27, 2013 20:47
Subject:
Re: Perl 5.18 and Regexp::Grammars
Message ID:
CAATtAp6_gYmS3-FMWqKvgkUcmY8HaSHuq76=KAwDyRaJJEVGaQ@mail.gmail.com
Nicholas Clark queried:

> the code should still be able to work if 'use re eval' is added to
> the scopes in which it it is used?

Should be able to work. But doesn't.

Here's a more detailed description of the various issues,
excerpted from a private discussion with rjbs:

    This is the current state of play, as I currently understand it
    (bear in mind that I have not had sufficient time to isolate the
    problems cleanly enough to be certain of all issues, and that I have
    no deep understanding of Perl's internals, so my conclusions should
    probably be treated as speculations only):

        * Regexp::Grammars (and other modules) use overload::constant
          'qr' to rewrite augemented regexes into standard Perl syntax,
          to provide new and useful functionality.

        * But overload::constant 'qr' is only applied to the
          compile-time constant portions of a regex.

        * That's fine for "peep-hole" rewriting on regexes
          (e.g. simulating a new \T metacharacter) but no good for
          transformations that apply to the entire regex
          (e.g. maintaining a parallel data-return stack on the entire
          parse). Because, if the regex contains an interpolated variable,
          overload::constant 'qr' doesn't see that piece of the final
          pattern at all, so global transformations of the pattern can't be
          applied to it.

        * To work around this problem, Regexp::Grammars (and other
          modules, notably Regexp::Debugger) use the technique of having
          overload::constant 'qr' *not* return a modified version of the
          original pattern "text".

        * Instead they have the overloading return a blessed object with
          its own '.' overloading. This '.' overloading then also
          returns a blessed object, so that (at runtime) the
          concatenation of interpolated variables in each regex
          eventually produces a single object containing *all* the
          "text" of the pattern, including the interpolated text.

        * This final object also has an overloaded stringification,
          which is automatically called when the regex is JIT-compiled,
          just before matching starts. At that point the stringification
          sub has access to the entire pattern "text", can apply the
          global rewriting transformation to all of it, and having done
          so can return a (now standard-syntax, but very much more
          complex) pattern string, which is finally JIT-compiled into an
          actual regex.

        * In recent versions of Perl, the overloaded stringification
          could be replaced by an overloaded qr-ification, but that sub
          has to return an actual Regexp object. So it has to compile
          the "pattern text" inside the qr-ification subroutine itself,
          which means the pattern is compiled in a different lexical
          scope from where it is declared. Hence any lexical variables
          inside (?{...}) or (??{...}) blocks cannot be correctly closed
          over. That means that a 'qr' overloading is often unacceptable
          if the original regexs--or the rewritten regex--might ever use
          either form of code block. As that's a reasonable expectation
          of *any* regex, this means that the 'qr' overloading is rarely
          useful in practice.

        * So far so good. At least up to Perl 5.16.

        * The first problem that has arisen in 5.18 is that any
          overload::constant 'qr' that uses objects to collect and then
          process the entire pattern "text" (i.e. the technique
          described above) is now potentially broken.

        * Specifically any 'qr' overloading that returns an object that
          stringifies to a pattern "text" that contains (?{...}) or
          (??{...}) will now *sometimes* trigger the dreaded 'use re
          "eval"' warning, even if there is a 'use re "eval"' in the
          scope where the pattern was originally defined.

        * This appears to be because the pattern "text" supplied by the
          object's final stringification is no longer compiled in the original
          regex's scope, so it is not protected even by an explicit 'use
          re "eval"' in that scope.

        * The second problem that has arisen in 5.18 is that variables
          that appear in (?{...}) or (??{...}) blocks are now checked
          for 'use strict' compliance *before* the 'qr' overloading is
          triggered, making it impossible to provide rewritings that
          sanitize such variables.

        * For example, R::G provides pseudo-variables $MATCH and %MATCH
          as an interface to the current parse tree node, rewriting them
          into internal (and 'strict safe') alternatives. In Perl 5.16
          this produces no problems, as the 'qr' overloading is called
          early so the rewritten 'strict safe' alternatives are the ones
          that the compiler actually encounters. In Perl 5.18 the
          compiler apparently encounters the pseudo-variables before
          the 'qr' can write them out of existence. This is not an
          insurmountable problem (I could change the interface...to
          something far less convenient for users), but it's certainly
          backwards incompatible.

        * The third problem that has arisen in 5.18 is when the module
          injects a code block that accesses an in-scope lexical
          variable. Those blocks, when compiled, appear to
          *sometimes* be failing to close over the correct variable.

        * For example, the R::G <%hash> construct is rewritten into a
          block like so:

                (??{
                        exists $hash{$^N} ? q{} : q{(?!)}
                })

          But, when matching, the lexical variable %hash appears to be
          empty inside the code block, even though it is not definitely
          empty in the enclosing lexical scope.

        * The final problem that has arisen in 5.18 is that several
          tests in R::G's suite changed from passing under 5.16 to
          segfaulting under 5.18. That's a separate, and arguably far
          more serious, problem in itself...and an indication that some
          deep issue still lurks in the new mechanism. I have not yet
          had the time to track this problem down more specifically.


BTW, I think it is very likely that Regexp::Debugger (which,
necessarily, uses exactly the same 'qr' overloading pattern) may be
susceptible to similar problems. In my view, that's a much more serious
issue: R::G is arguably a niche product, but R::D should be in every
Perl developer's toolbox.


Finally, please note that I am currently mired in final preparations for
my annual speaking tour, which starts next week. As a result, my response-
time will be poor, and I will not be able to properly perlbug this issue
(i.e. boil Regexp::Grammars' 2500 lines of code down to minimal examples)
for at least a month.

I appreciate everyone's concern over this issue, and apologize for the trouble
it's causing.

Damian

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About