develooper Front page | perl.perl5.porters | Postings from June 2021

qr'$foo' and unescaped $ signs

Thread Next
From:
Nicholas Clark
Date:
June 29, 2021 14:30
Subject:
qr'$foo' and unescaped $ signs
Message ID:
20210629143001.GF9170@etla.org
One of the obscure features of regular expressions is that if one uses a
single quote as the delimiter, no interpolation takes place.

One result of this - there's a Data::Dumper bug about handling them:

https://rt.cpan.org/Public/Bug/Display.html?id=84569

You get this:

$ perl -MData::Dumper -w
print Dumper(qr'$foo');
__END__
$VAR1 = qr/(?^:$foo)/;


That is not correct (and it's still the behaviour in blead)

However, I'm struggling - is this a core bug. Or a DD bug. How do these
crazy regexs really work.

Say I take this:

$ cat ~/test/single.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Devel::Peek;

use re 'debug';

my $re = qr'$foo';

print "$re\n";
print Dumper($re);
++$Data::Dumper::Useperl;
print Dumper($re);

print "\$foo" =~ $re ? "Match\n" : "not\n";

Dump($re);

__END__

$ perl ~/test/single.pl
Compiling REx "$foo"
Final program:
   1: SEOL (2)
   2: EXACT <foo> (4)
   4: END (0)
anchored "foo" at 0..0 (checking anchored) minlen 3
(?^:$foo)
$VAR1 = qr/$foo/;
$VAR1 = qr/$foo/;
Matching REx "$foo" against "$foo"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [0..4] gave 1
  Found anchored substr "foo" at offset 1 (rx_origin now 1)...
  (multiline anchor test skipped)
  try at offset...
Intuit: Successfully guessed: match at offset 1
   1 <$> <foo>               |   0| 1:SEOL(2)
                             |   0| failed...
Match failed
not
SV = IV(0xbc3920) at 0xbc3920
  REFCNT = 1
  FLAGS = (ROK)
  RV = 0xbbe138
  SV = REGEXP(0xcb5944) at 0xbbe138
    REFCNT = 1
    FLAGS = (OBJECT,POK,FAKE,pPOK)
    PV = 0xcbf1f8 "(?^:$foo)"
...



Compare this with *trying* to re-write qr'$foo' as qr/\$foo/:

$ cat ~/test/normal.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Devel::Peek;

use re 'debug';

my $re = qr/\$foo/;

print "$re\n";
print Dumper($re);
++$Data::Dumper::Useperl;
print Dumper($re);

print "\$foo" =~ $re ? "Match\n" : "not\n";

Dump($re);

__END__

$ perl ~/test/normal.pl
Compiling REx "\$foo"
Final program:
   1: EXACT <$foo> (3)
   3: END (0)
anchored "$foo" at 0..0 (checking anchored isall) minlen 4
(?^:\$foo)
$VAR1 = qr/\$foo/;
$VAR1 = qr/\$foo/;
Matching REx "\$foo" against "$foo"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [0..4] gave 0
  Found anchored substr "$foo" at offset 0 (rx_origin now 0)...
  (multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
Match
SV = IV(0x1726920) at 0x1726920
  REFCNT = 1
  FLAGS = (ROK)
  RV = 0x1721138
  SV = REGEXP(0x18189d4) at 0x1721138
    REFCNT = 1
    FLAGS = (OBJECT,POK,FAKE,pPOK)
    PV = 0x1821e88 "(?^:\\$foo)"
...




My *first* thought when looking at all the Devel::Peek output was "there is
nothing recording that the regex was written with '' - this is wrong?"

And as there's no hint about the '' in the internal state, this must be
buggy because anything trying to interpolate them is going to mistake
'$foo' for $foo and violate strict.

So I played, and nothing goes wrong


After quite a while, my *second* thought was "hang on - those two aren't
the same regular expression" - ie qr'$foo' isn't qr/\$foo/


If you look at the regex debug output, qr'$foo' is

1) the anchor $
2) the 3 character fixed string foo


while qr/\$foo/ is

2) the 4 character fixed string $foo


at which point it's starting to make slightly more sense.

(the second regex differs in that has the EXACT flag set on it, which
isn't because of the '' vs //, but because it's entirely a literal string,
so the result from the optimiser can be used directly, without hitting the
main regex engine.)



So, I think what matters here is that scalar variable interpolation happens
early in regex compilation, and only happens once. If you interpolate a
regex, the *literal* content of that regex is interpolated (however strange)
but isn't then subject to another round of variable interpolation.

So, I'm right in thinking

1) the perl core is correct in what it stores into the C structures?
2) what Devel::Peek reports is sane?
3) this all works with interpolation?
4) if I see \$ in the C structures, it came from \$ in qr// qr"" etc?

(even if the match itself makes no sense, and won't work without //m.)
(In which case, the bug *is* in Data::Dumper.)

So I guess, the question is:

If the PV stored for a regex has an *unescaped* dollar sign in it it *has*
to have originated from a regex written with '' ?

ie for all the other pattern delimiters, which are doing what perlre calls
"double-quotish context", the two literals in the pattern \$ are passed on
downwards as \$ - ie they don't just suppress interpolation in the perl parser,
but also are passed down to regcomp.c as a literal pair \$


meaning that if one scans the internal string of the regex (stored as the PV)
and finds an unescaped $ anywhere (other than the last character) then

1) the *only* way that could have got there was by being written as qr''
2) the *only* way to convert that back to a regex is to output qr''
   (ie there exists no escaping syntax capable of recreating it inside qr//)



I suspect that this rabbit hole goes deeper. What did I miss? What did I get
wrong?

Nicholas Clark


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About