develooper Front page | perl.perl5.porters | Postings from February 2004

[perl #26909] Peculiar access to lexical vars from code in regexes

Thread Next
Jamie Lokier
February 21, 2004 13:47
[perl #26909] Peculiar access to lexical vars from code in regexes
Message ID:
# New Ticket Created by  Jamie Lokier 
# Please include the string:  [perl #26909]
# in the subject line of all future correspondence about this issue. 
# <URL: >

This is a bug report for perl from,
generated with the help of perlbug 1.34 running under perl v5.8.0.

[Please enter your report here]

Perl allows regular expressions to contain code which is executed when
that position in the pattern is reached during a match.  See `man
perlre' for details.  For example:


If the code refers to lexical variables in the surrounding subroutine
scope, the code accesses the first _instance_ of those lexicals that
exists when the regex is first compiled.  Subsequent calls reference
that instance, even if the lexical has been destroyed and recreated by
virtue of the scope having been exited and re-entered.

This is very peculiar behaviour.

For example, the following subroutine returns (1, 0, 0), not the more
logical (1, 1, 1):

    sub test1() { map { my $x = 0; /(?{$x++})/; $x; } (1..3) }

This subroutine returns (1, 2, 3) as expected:

    sub test2() { my $x = 0; map { /(?{$x++})/; $x; } (1..3) }

However, when it is called a second time, it returns (0, 0, 0).

These problems occur with string-interpolated regexes too, independent
of whether or not the regex "o" flag is used (because Perl still tries
to avoid recompiling a regex if it hasn't changed between calls).

To ensure the expected variables are referenced inside the regex code,
the variables need to be at a large enough scope that they remain live
between calls to the regex.  For example, this function returns (1, 2,
3) every time it is called:

    my $test3_x;
    sub test3() { $test3_x = 0; map { /(?{$test3_x++})/; $test3_x; } (1..3) }

In general, when code inside a regex references variables, you have to
make sure those variables are globals (declared with "our") or local
to the package (declared with "my" _outside_ any "sub" definitions).
If you forget to do this, your program is likely harbouring an obscure
bug which will be difficult to track down.

You can still use lexical scopes to confine the variable names to a
small region of code.  It simply has to be outside a scope which is
exited and re-entered, which usually means a subroutine scope.  For
example, this subroutine also returns (1, 2, 3) every time it is called,
and doesn't define any variables which are visible to any other code:
    { my $x; sub test4() { $x = 0; map { /(?{$x++})/; $x; } (1..3) } }

When a regex object is defined using the "qr//" operator, and it is
called unchanged from a match ("m//") or substitution ("s///"), code in
the regex will access the first instances of lexical variables at the
scope where the "qr//" appears, not where it is called.

However if the regex object is interpolated into another pattern, code
will access the first instances of lexical variables at the point of

For example:

    my $x;		# Outer $x.
    my $re = qr/(?{$x++})/;
    {			# Inner $x.
        my $x;
        /$re/;		# Code in the regex increments Outer $x.
        /$re()/;	# Code in the regex increments Inner $x.

A consequence of this behaviour is that you can't write self-contained
parsing subroutines that look like this, because they don't work:

    sub count_cats_and_dogs($) {
        my ($cats, $dogs) = (0, 0);
        $_[0] =~ /(?:.*?\b(?:cat\b(?{$cats++})|dog\b(?{$dogs++})))*/g;
        return ($cats, $dogs);

Instead, you have to write in this awkward style:

        my ($cats, $dogs);
        sub count_cats_and_dogs($) {
            ($cats, $dogs) = (0, 0);
            $_[0] =~ /(?:.*?\b(?:cat\b(?{$cats++})|dog\b(?{$dogs++})))*/g;
            return ($cats, $dogs);

The worst thing is that it's _very_ easy to miss bugs like that.  The
code compiles fine, and appears to work just fine until some corner
case is matched, and then it starts giving odd results that don't make
sense until you notice this odd semantic.

Clearly, the more intuitive behaviour is for code within a regex,
which is referencing lexical variables defined within a sub but
outside the regex, to access the instances of those lexicals which
exist when the regex is called each time it is called.

Perl must already be passing some kind of static-chain for access to
lexicals from evals in regexes, because they appear to be properly
thread-specific in interpreter threads - each thread does access its
own instance of the variable.  So I'd guess the static-chain is simply
not correctly prepared.

If the intuitive behaviour is too difficult to implement, or if it
doesn't make sense after all (for example, I'm not sure what behaviour
makes sense in conjunction with qr// and lexicals defined in different
sub() scopes to the one where the regex is called), then a warning
would be a very desirable addition:

A warning whenever code in a regex references a lexical that is named
inside a sub() scope would be _exceedingly_ useful.  It is almost
always a program bug.  Lexicals names outside all sub() scopes should
not induce the warning.

Also, I didn't see anything in the FAQ about this.

-- Jamie

[Please do not change anything below this line]
Site configuration information for perl v5.8.0:

Configured by bhcompile'
cf_email='bhcompile at Wed Aug 13 11:45:59 EDT 2003.

Summary of my rderl (revision 5.0 version 8 subversion 0) configuration:
    osname=linux, osvers=2.4.21-1.1931.2.382.entsmp, archname=i386-linux-thread-multi
    uname='linux str'
    config_args='-des -Doptimize=-O2 -g -pipe -march=i386 -mcpu=i686 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Dotherlibdirs=/usr/lib/perl5/5.8.0 -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef'
 useithreads=define usemultiplicity=
    useperlio= d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=un uselongdouble=
    usemymalloc=, bincompat5005=undef
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='3.2.2 20030222 (Red Hat Linux 3.2.2-5)', gccosandvers=''
gccversion='3.2.2 200302'
    intsize=r, longsize=r, ptrsize=5, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
k', ivsize=4'
ivtype='l, nvtype='double'
o_nonbl', nvsize=, Off_t='', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
l', ldflags =' -L/u'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt -lutil
    libc=/lib/, so=so, useshrplib=true, libperl=libper
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so', d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE'
ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5', lddlflags='s Unicode/Normalize XS/A'

Locally applied patches:

@INC for perl v5.8.0:

Environment for perl v5.8.0:
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PERL_BADLANG (unset)
    dlflags='-share (unset)

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About