Front page | perl.perl5.porters |
Postings from June 2013
Re: Perl 5.18 and Regexp::Grammars
Thread Previous
|
Thread Next
From:
Damian Conway
Date:
June 27, 2013 20:47
Subject:
Re: Perl 5.18 and Regexp::Grammars
Message ID:
CAATtAp6_gYmS3-FMWqKvgkUcmY8HaSHuq76=KAwDyRaJJEVGaQ@mail.gmail.com
Nicholas Clark queried:
> the code should still be able to work if 'use re eval' is added to
> the scopes in which it it is used?
Should be able to work. But doesn't.
Here's a more detailed description of the various issues,
excerpted from a private discussion with rjbs:
This is the current state of play, as I currently understand it
(bear in mind that I have not had sufficient time to isolate the
problems cleanly enough to be certain of all issues, and that I have
no deep understanding of Perl's internals, so my conclusions should
probably be treated as speculations only):
* Regexp::Grammars (and other modules) use overload::constant
'qr' to rewrite augemented regexes into standard Perl syntax,
to provide new and useful functionality.
* But overload::constant 'qr' is only applied to the
compile-time constant portions of a regex.
* That's fine for "peep-hole" rewriting on regexes
(e.g. simulating a new \T metacharacter) but no good for
transformations that apply to the entire regex
(e.g. maintaining a parallel data-return stack on the entire
parse). Because, if the regex contains an interpolated variable,
overload::constant 'qr' doesn't see that piece of the final
pattern at all, so global transformations of the pattern can't be
applied to it.
* To work around this problem, Regexp::Grammars (and other
modules, notably Regexp::Debugger) use the technique of having
overload::constant 'qr' *not* return a modified version of the
original pattern "text".
* Instead they have the overloading return a blessed object with
its own '.' overloading. This '.' overloading then also
returns a blessed object, so that (at runtime) the
concatenation of interpolated variables in each regex
eventually produces a single object containing *all* the
"text" of the pattern, including the interpolated text.
* This final object also has an overloaded stringification,
which is automatically called when the regex is JIT-compiled,
just before matching starts. At that point the stringification
sub has access to the entire pattern "text", can apply the
global rewriting transformation to all of it, and having done
so can return a (now standard-syntax, but very much more
complex) pattern string, which is finally JIT-compiled into an
actual regex.
* In recent versions of Perl, the overloaded stringification
could be replaced by an overloaded qr-ification, but that sub
has to return an actual Regexp object. So it has to compile
the "pattern text" inside the qr-ification subroutine itself,
which means the pattern is compiled in a different lexical
scope from where it is declared. Hence any lexical variables
inside (?{...}) or (??{...}) blocks cannot be correctly closed
over. That means that a 'qr' overloading is often unacceptable
if the original regexs--or the rewritten regex--might ever use
either form of code block. As that's a reasonable expectation
of *any* regex, this means that the 'qr' overloading is rarely
useful in practice.
* So far so good. At least up to Perl 5.16.
* The first problem that has arisen in 5.18 is that any
overload::constant 'qr' that uses objects to collect and then
process the entire pattern "text" (i.e. the technique
described above) is now potentially broken.
* Specifically any 'qr' overloading that returns an object that
stringifies to a pattern "text" that contains (?{...}) or
(??{...}) will now *sometimes* trigger the dreaded 'use re
"eval"' warning, even if there is a 'use re "eval"' in the
scope where the pattern was originally defined.
* This appears to be because the pattern "text" supplied by the
object's final stringification is no longer compiled in the original
regex's scope, so it is not protected even by an explicit 'use
re "eval"' in that scope.
* The second problem that has arisen in 5.18 is that variables
that appear in (?{...}) or (??{...}) blocks are now checked
for 'use strict' compliance *before* the 'qr' overloading is
triggered, making it impossible to provide rewritings that
sanitize such variables.
* For example, R::G provides pseudo-variables $MATCH and %MATCH
as an interface to the current parse tree node, rewriting them
into internal (and 'strict safe') alternatives. In Perl 5.16
this produces no problems, as the 'qr' overloading is called
early so the rewritten 'strict safe' alternatives are the ones
that the compiler actually encounters. In Perl 5.18 the
compiler apparently encounters the pseudo-variables before
the 'qr' can write them out of existence. This is not an
insurmountable problem (I could change the interface...to
something far less convenient for users), but it's certainly
backwards incompatible.
* The third problem that has arisen in 5.18 is when the module
injects a code block that accesses an in-scope lexical
variable. Those blocks, when compiled, appear to
*sometimes* be failing to close over the correct variable.
* For example, the R::G <%hash> construct is rewritten into a
block like so:
(??{
exists $hash{$^N} ? q{} : q{(?!)}
})
But, when matching, the lexical variable %hash appears to be
empty inside the code block, even though it is not definitely
empty in the enclosing lexical scope.
* The final problem that has arisen in 5.18 is that several
tests in R::G's suite changed from passing under 5.16 to
segfaulting under 5.18. That's a separate, and arguably far
more serious, problem in itself...and an indication that some
deep issue still lurks in the new mechanism. I have not yet
had the time to track this problem down more specifically.
BTW, I think it is very likely that Regexp::Debugger (which,
necessarily, uses exactly the same 'qr' overloading pattern) may be
susceptible to similar problems. In my view, that's a much more serious
issue: R::G is arguably a niche product, but R::D should be in every
Perl developer's toolbox.
Finally, please note that I am currently mired in final preparations for
my annual speaking tour, which starts next week. As a result, my response-
time will be poor, and I will not be able to properly perlbug this issue
(i.e. boil Regexp::Grammars' 2500 lines of code down to minimal examples)
for at least a month.
I appreciate everyone's concern over this issue, and apologize for the trouble
it's causing.
Damian
Thread Previous
|
Thread Next