develooper Front page | perl.perl5.porters | Postings from July 2013

Re: Perl 5.18 and Regexp::Grammars

Thread Previous | Thread Next
Dave Mitchell
July 1, 2013 16:48
Re: Perl 5.18 and Regexp::Grammars
Message ID:
On Fri, Jun 28, 2013 at 06:46:51AM +1000, Damian Conway wrote:
> Here's a more detailed description of the various issues,
> excerpted from a private discussion with rjbs:

(Please respond to this only when you have the time).


First, a bit of background about how (?{}) handling has changed in 5.18.0.

Pre 5.18, the compilation of code blocks was handled by doing an eval
of individual code blocks from within the regex engine, after the pattern
string had been fully assembled.

Apart from implementation deficiencies, like the fact that you usually got
SEGVs as soon as you tried to access outer lexical vars, it was
conceptually flawed. For example in the following:

    for my $x ((qw(a b c)) {
	push @r, qr/(??{ $x })/;

you might expect the pattern to be equivalent to /-b-/, but before 5.18 it
was actually equivalent to /--/.

The general fix was to handle code blocks directly in the parser, like
the way that something like "a$foo{expr}a" is already handled: the
perl toker converts that into something like

    ('a' . $foo{expr}. 'b')

Note that the index expression is parsed directly; in particular something
like "a$foo{'}'}a" works, and doesn't rely on balanced braces.

Similarly,  /a(?{expr}b/

is now toked up as

    regcomp('a', {expr}, 'b')

so the code block is directly compiled as part of the same pass as the
surrounding code (and is part of the same sub and shares the same pad).

Similarly, qr/a(?{expr}b/

is now toked up (roughly speaking) as

    sub { regcomp('a', {expr}, 'b') }

so each time you execute qr//, you get a new closure.

In the presence of

    overload::constant::qr => \&my_qr;

/a(?{expr}b/ gets toked as

    regcomp(my_qr('a'), {expr}, my_qr('b'))

(with the calls to my_qr() done during toking).

Note that the code block text is no longer passed to my_qr().

At the beginning of regex compilation, the list of args is concatenated
into a string, using the usual '.', '""' and 'qr' overload methods.
The slight catch to this is that the concatter has access to the original
text of each code block, plus a pointer to the op subtree. It concats the
text of the code block, but rather than asking the regex compiler to
compile the code block, it just uses the already-compiled op-tree. It
remembers the substring of the pattern which corresponds to that op tree;
so in something like /ab(?{})/ it knows that the first optree maps to
chars 2-6 of the pattern.

There are two ways this fails to work. First, in a runtime pattern like

    $c = '(?{})'; /ab$c/

the concatter fails to find an optree that corresponds to the code block
text (i.e. chars 2-6); in this case it dies unless 'use re eval'
is in scope; or if it is, it reparses the pattern string to compile any
run-time code blocks.

The second case is in the presence of overloaded concatenation; in this
case perl can no longer reliably assume that the code text maps to
pre-compiled optrees, and in this case it puts its hands up in horror,
throws away the optree and treats it as a run-time code block (and insists
on 'use re eval'.

(This may at least partially explain when you're seen it randomly
requiring 'use re eval'.)

Suggested workaround

If I understand requirements correctly, what you need is, given a user
regex within the scope of 'Regex::Grammar', to extract out the full final
text of the regex (which may include run-time components and code blocks),
then to modify that text, including:
    * injecting new code blocks into the text;
    * modifying some variable references within user code blocks;
then compile the regex, including compiling all the code blocks within the
scope of the caller.

I think the following code demonstrates all the above working. It relies
on the fact that concatenation overload triggers disabling of compile-time
code blocks, and forces everything to run-time. So it relies on
'use re eval' being in scope at the caller.

    package Foo;

    our $lexical = 'XXX BAD XXX';

    use overload
	q{""} => sub {
			my ($pat) = @_;
			$pat = $pat->[0];

			# demo: map '<ABC>' to '(??{$lexical})ABC'
			$pat =~ s/<(\w+)>/(??{\$lexical})$1/g;

			# demo: map certain keywords in code blocks to
			# literal strings (XXX this doesn't check whether
			# they're actually within a code block)
			$pat =~ s/MATCH/"match"/g;
			$pat =~ s/ARGS/"args"/g;
	q{.}  => sub {
			my ($a1, $a2) = @_;
			$a1 = $a1->[0] if ref $a1;
			$a2 = $a2->[0] if ref $a2;
			bless [ "$a1$a2" ], 'Foo';

    package main;

    BEGIN {
	overload::constant qr => sub { return bless [ $_[0] ], 'Foo' };

    my $str = 'ghi-<BAR>-jkl-(??{ MATCH })-mno';
    my $lexical = "lex";

    use re 'eval';
    my $r = qr/^abc-<FOO>-def-(??{ ARGS })-$str$/;
    "abc-lexFOO-def-args-ghi-lexBAR-jkl-match-mno" =~ $r or die;
    print $r, "\n";

This outputs, on both 5.16.3 and 5.18.0:

(?^:^abc-(??{$lexical})FOO-def-(??{ "args" })-ghi-(??{$lexical})BAR-jkl-(??{ "match" })-mno$)

Note that it matches, and uses the 'lex' value of $lexical in both the
user's and Foo's code blocks.

Is this close to what you need?

I haven't considered the case where there is only one component in the
regex, so concatenation isn't triggered. This might throw a spanner in the

Finally, here are some replies to your specific issues.

>         * Specifically any 'qr' overloading that returns an object that
>           stringifies to a pattern "text" that contains (?{...}) or
>           (??{...}) will now *sometimes* trigger the dreaded 'use re
>           "eval"' warning, even if there is a 'use re "eval"' in the
>           scope where the pattern was originally defined.

I can't reproduce this; I would need sample code.

>         * The second problem that has arisen in 5.18 is that variables
>           that appear in (?{...}) or (??{...}) blocks are now checked
>           for 'use strict' compliance *before* the 'qr' overloading is
>           triggered, making it impossible to provide rewritings that
>           sanitize such variables.

Yep, you can't rewrite code blocks any more, unless you can force them to
become run-time, then overload-concatenate them, as shown above.

>         * The third problem that has arisen in 5.18 is when the module
>           injects a code block that accesses an in-scope lexical
>           variable. Those blocks, when compiled, appear to
>           *sometimes* be failing to close over the correct variable.
>         * For example, the R::G <%hash> construct is rewritten into a
>           block like so:
>                 (??{
>                         exists $hash{$^N} ? q{} : q{(?!)}
>                 })
>           But, when matching, the lexical variable %hash appears to be
>           empty inside the code block, even though it is not definitely
>           empty in the enclosing lexical scope.

Again, I'd need sample code that reproduces this.

>         * The final problem that has arisen in 5.18 is that several
>           tests in R::G's suite changed from passing under 5.16 to
>           segfaulting under 5.18. That's a separate, and arguably far
>           more serious, problem in itself...and an indication that some
>           deep issue still lurks in the new mechanism. I have not yet
>           had the time to track this problem down more specifically.

This appears to be a separate issue; I reduced it to the following code
which segfaults:


It's related to '{0}' triggering a new optimisation that deletes the node
to the left of it. This is likely to be fixed for 5.18.1. The workaround
(if possible) is to avoid '{0}'. I don't know whether there will be any
remaining segfault issues after that.

In my day, we used to edit the inodes by hand.  With magnets.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About