develooper Front page | perl.perl5.porters | Postings from August 2013

Re: Perl 5.18 and Regexp::Grammars

Thread Previous | Thread Next
Dave Mitchell
August 8, 2013 15:48
Re: Perl 5.18 and Regexp::Grammars
Message ID:
On Mon, Jul 15, 2013 at 04:57:43PM +0400, Damian Conway wrote:
> >>         * The second problem that has arisen in 5.18 is that variables
> >>           that appear in (?{...}) or (??{...}) blocks are now checked
> >>           for 'use strict' compliance *before* the 'qr' overloading is
> >>           triggered, making it impossible to provide rewritings that
> >>           sanitize such variables.
> >
> > Yep, you can't rewrite code blocks any more, unless you can force them to
> > become run-time, then overload-concatenate them, as shown above.
> Even when they are forced to become run-time (using your workaround code),
> 'use strict' compliance seems to be tested too early (i.e. before the
> qr-overloading has a chance to "vanish" the variable in question).
> For example, the following code works as expected under 5.14
> (i.e. the post-processed regex correctly matches), but under 5.18
> it generates an odd "double fatality" compile-time error:
>     Global symbol "$MAGIC_VAR" requires explicit package name at
> line 32.
>     Global symbol "$MAGIC_VAR" requires explicit package name at (eval
> 1) line 1.
> Once again, the RegexpProcessor code is identical to Dave's workaround
> code, except that this time the commented line has been added to the
> qr-overloading in order to replace $MAGIC_VAR in the source with 'foo'
> (this is a minimal version of the various kinds of much more complex
> manipulations that Regexp::Grammars actually does):
> -----cut----------cut----------cut----------cut----------cut----------cut-----
>     package RegexProcessor;
>     use overload (
>         q{""} => sub {
>                         my ($pat) = @_;
>                         return $pat->[0];
>                  },
>         q{.}  => sub {
>                         my ($a1, $a2) = @_;
>                         $a1 = $a1->[0] if ref $a1;
>                         $a2 = $a2->[0] if ref $a2;
>                         return bless [ "$a1$a2" ], 'RegexProcessor';
>                  },
>     );
>     package main;
>     use re 'eval';
>     BEGIN {
>         overload::constant qr => sub {
>             my ($regex_pattern) = @_;
>             # Replace raw $MAGIC_VAR with 'foo'...
>             # (A greatly simplified version of what Regexp::Grammars does)
>             $regex_pattern =~ s/\$MAGIC_VAR/'foo'/g;
>             return bless [ $regex_pattern ], 'RegexProcessor'
>         };
>     }
>     use strict;
>     say 'matched' if "foobar" =~ m{ (??{ $MAGIC_VAR }) bar }xms;
> -----end----------end----------end----------end----------end----------end-----

I think that this is the one that will be impossible to work fully
workaround; i.e. the modification of user-supplied code blocks before the
perl parser gets to see them.

Note first that moving the code-manipulation from the overload q{""}
function (as it was in my sample code) to the overload::constant qr
function (as it is in your sample code) will never work: the
overload::constant function is never called under any circumstances for
the text of literal code blocks in 5.18.x; which is why in my example
code I did the manipulation in the final stringification call (q{""}).

Before I discuss this in more detail, first can I ask whether its
absolutely necessary for R::G to modify user code? Could the effects you
achieve be done by exporting (say) a tied var $MAGIC_VAR into the callers

Anyway, let me explain in a bit more detail what's going on.
(if this is tl;dr, then just skip the end where I discuss alternatives)

In the presence of overload::constant qr => \&f, a general regex like
/abc(?{d})e$f/, is toked/parsed at the same time as the surrounding perl
code, into a list op that looks like

    regcomp(f('abc'), '(?{d})', {d}, f('e'), $f);

where the calls to f() are done at compile time, so if we have, say,

    sub f { uc $_[0] }

then the above actually arrives at the parser as:

    regcomp('ABC', '(?{d})', {d}, 'E', $f);

Note that the text of the code block is *not* passed through f(). Also,
note that both the text of the code block and the code block itself are
passed; the text is so that the regex compiler itself can assemble the
full, original text of the regex (so that print qr/(?{})/ will display the
right thing for example), but that also the 'bare' code is exposed and is
parsed and compiled along with everything else - so the {d} above is a bit
like the code block in map or grep.

The regcomp() above will be processed at compile-time if all of its
components are compile-time (so the above without the $f, for example),
and at run-time otherwise.

In either case, the regex compiler is called with
a) a list of strings (or regex objects) like ('ABC', '(?{d})', 'E',
   whatever $f contains);
b) a list of optrees, one for each literal codeblock that got parsed
   (so {d} in the above).

The regex compiler concats the list of strings into a single string that
represents the final pattern to be compiled.
If there is just a single item in the list, then 'qr' or '""' overloading
will be called if available to convert that single item into a final
pattern (or regex object). If there are multiple items, then we start with
an empty string, then concat each item to it, first applying
qr-overloading if necessary, then calling '.' overloading if it exists,
falling back to plain concatenation (using '""' overloading on the item if
it exists). Finally after the pattern is assembled, '""' overloading is
used to retrieve its final value.

During this assembly, optrees are paired up with the parts of the
final pattern string that correspond to the text of literal code blocks.
So that when the patten string is finally passed through the regex
compiler, when it sees a '(?{', it knows to use optree #3 (say) and
attaches that tree to appropriate regex node.  If there's a '(?{' or
'(??{' in the pattern that doesn't correspond to an optree (e.g. it was
introduced by $f above, or by overloading), then the pattern is evalled,
but with any literal code-blocks blanked out.  So in the above, if $f
contained '(?{f})' and there was no funny overloading, the pattern string
would be 


where the second (?{f}) doesn't have a corresponding optree. At this point
we check that 'use re eval' is in scope and if not, croak(). Otherwise, we
internally eval the string


and from the returned object, extract out the optree for the (?{f}) block,
and continue as before.

A similar thing happens in the presence of concat overloading; since
the final pattern string may contain the text of code-blocks that no
longer match what the parser's already seen and compiled into optrees, we
abandon any existing optrees and treat every (?{}) as a runtime code block
and recompile as above. This is why in your code example you got two
warnings/errors: the same code block was compiled twice; first as literal
code, then after the overloading triggered throwing it away, it was
compiled a second time as a run-time pattern.

Note that this means that even if the '""' overloading gets a chance to
rewrite the code block, the pre-modification code block will get compiled
and then discarded; and this may trigger errors such as the
    "$MAGIC_VAR" requires explicit package name
which can't be avoided.

Note also that if the user's regex consists purely of a code block, with
no constant text (such as /(?{a})/ verses /(?{a})b/), then the whole
"make overload::constant qr return an overloaded object" trick fails to
work, since the text of code blocks isn't passed through the const mapper.


For those two reasons (bare code blocks don't get processed; code blocks
are compiled - and possibly error out - before they can be modified), I
don't think R::G as it stands is viable under 5.18.x.

Which leads me to repeat the question: is it possible for R::G to work
without the facility to modify the text of user code blocks?
If not, then I wonder whether, for 5.20.0, we could add a new facility
that would allow you to do what you need. I'm open to suggestions, but one
possibility might be to add a 'raw' type to overload::constant that is
passed the whole literal string, before interpolation etc (i.e. it sees it
as a single-quoted string), and before the rest of the perl toker and
parser has seen it. For example currently, this

    use overload;
    BEGIN {
	overload::constant qr => sub {
	    my $s = shift; print "qr($s)\n"; $s;
	overload::constant 'q' => sub {
	    my $s = shift; print "q($s)\n"; $s;




I'm suggesting that in addition we allow you to add, say,

	overload::constant raw => sub {
	    my $s = shift; print "raw($s)\n"; $s;

that when used along with the two existing overloads, gives


Would this help??? Or would the fact that it passes you the names of
run-time variables rather than their values, be just as bad?

Atheism is a religion like not collecting stamps is a hobby

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About