develooper Front page | perl.perl5.porters | Postings from October 2016

[perl #129803] Regexp syntax check when variables contained with\Q...\E

Thread Next
Ed Avis
October 4, 2016 14:29
[perl #129803] Regexp syntax check when variables contained with\Q...\E
Message ID:
# New Ticket Created by  "Ed Avis" 
# Please include the string:  [perl #129803]
# in the subject line of all future correspondence about this issue. 
# <URL: >

This is a bug report for perl from,
generated with the help of perlbug 1.40 running under perl 5.22.2.

[Please describe your issue here]

When possible regular expressions are compiled and checked at program compile time.
This program gives an error even though the sub f() is never called:

    sub f { /(/ }

But when the regexp includes variables, this early checking cannot be done.
The following code will only give an error if and when f() is called:

    $x = 'a';
    sub f { /($x/ }

In general, this has to be so.  Perl can't know what the possible
values of $x will be at run time.  If $x contains ')' then the regexp
is well-formed.

However, when building a regexp you may choose to use the \Q...\E
mechanism as a safer alternative to raw string interpolation.
Whatever appears inside \Q...\E is effectively matched as a literal
string, with regexp metacharacters like ) not having their usual
effect.  (As an implementation detail, this might work by \Q...\E
carefully escaping all such characters with backslashes before parsing
the regexp, but from the user's point of view you can consider it a
way to match literal text contained in a scalar.)

If the variable appears in the regexp protected by \Q...\E then the
syntax checks can still happen as normal.

    sub f { /(\Q$x\E/ }

No matter the value of $x at run time, this will never be a valid
regexp; it will always have an unbalanced ( at the start.  Perl could
warn for this regexp at compile time just as it warns for /(/.

One possible way to do this would be to try making a munged version of
the regexp, replacing all \Q...\E fragments of the regexp with the
dummy construct (?:).  If the resulting munged regexp does not contain
any variables, then it can be used as a compile-time check; the munged
regexp will be syntactically valid if and only if the original regexp
is syntactically valid for all possible uses.

There might be a better way to do this checking depending on
implementation details of the regexp engine.

* Further discussion

Not strictly part of the bug report, but I would like to mention a
possible avenue to improve compile-time checking of composed regexps
when \Q...\E is not used.

The other major way to build up a regexp from variables is to include
a variable which itself holds a compiled regular expression (qr/.../).
Here, too, you can detect some regexp syntax errors for certain, no
matter what other regexp the variable may hold.

    my $re = qr/whatever/;

This is always going to be an invalid regexp as long as $re holds a
compiled regular expression object.  The compiled regexp included can
never manage to close the open ( since it must itself have balanced
parentheses.  But Perl doesn't have a way to know that $re will always
hold a compiled regexp at run time.

Suppose there were a way for the programmer to specify his or her
intention that the variable will always hold a compiled regexp.  Let's
say that \I...\J is used for this (this syntax is arbitrary and only
for the sake of discussion).  So you would write

    my $re_1 = qr/hel+o/;
    my $re_2 = qr/there/;
    my $re = qr/ \I$re_1\J \s+ \I$re_2\J /x;

At the point when $re is compiled, and the values of $re_1 and $re_2
are included in the larger regexp string, Perl would check at run time
that they really were compiled regexp objects and not arbitrary

In itself, this might catch a few bugs but is perhaps not worth the
extra clunkiness.  However, it would also enable compile-time checking
of regexp syntax in the same way as \Q...\J would.

    my $re = qr/whatever/;

Here Perl can know that since $re is always going to be a compiled
regexp, the resulting regexp of /($re/ will always have unbalanced
parens.  The checking may have more subtleties than for \Q...\E since
an included compiled regexp can change the number of capturing groups,
for example, affecting the validity of later backreferences.  Perhaps
only the most basic syntax checks (like making sure parens are
balanced) could be done at program compile time, with more 'semantic'
ones (like a backreference to a nonexistent group) done when the
string interpolation has been done and the final regexp is compiled.

Finally, only for those who enjoy the bondage and discipline, an
optional 'strict mode' would forbid arbitrary string interpolation in
regexps, requiring all variable uses to be explicitly labelled as
either \Q...\E (for matching a literal string) or \I...\J (for
including a precompiled regexp fragment).  Of course, there are cases
where you want to build up a regexp from fragments which are not
themselves valid regexps, so strict mode would not be appropriate all
the time and certainly not on by default.  But it would be a useful
way to eliminate regexp-injection bugs.  (Taint mode helps too, of
course, but is a run time check only.)

[Please do not change anything below this line]
Site configuration information for perl 5.22.2:

Configured by Red Hat, Inc. at Wed Aug  3 14:06:20 UTC 2016.

Summary of my perl5 (revision 5 version 22 subversion 2) configuration:
    osname=linux, osvers=4.6.3-300.fc24.x86_64, archname=x86_64-linux-thread-multi
    uname='linux 4.6.3-300.fc24.x86_64 #1 smp fri jun 24 20:52:41 utc 2016 x86_64 x86_64 x86_64 gnulinux '
    config_args='-des -Doptimize=none -Dccflags=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic -Dldflags=-Wl,-z,relro  -Dccdlflags=-Wl,--enable-new-dtags -Wl,-z,relro  -Dlddlflags=-shared -Wl,-z,relro  -Dshrpdir=/usr/lib64 -DDEBUGGING=-g -Dversion=5.22.2 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallu
 srbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fwrapv -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='  -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fwrapv -fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='5.3.1 20160406 (Red Hat 5.3.1-6)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags ='-Wl,-z,relro  -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib64 /lib64 /usr/lib64 /usr/local/lib /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib
    libs=-lpthread -lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -lresolv -lnsl -ldl -lm -lcrypt -lutil -lc, so=so, useshrplib=true,
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,--enable-new-dtags -Wl,-z,relro '
    cccdlflags='-fPIC', lddlflags='-shared -Wl,-z,relro  -L/usr/local/lib -fstack-protector-strong'

Locally applied patches:
    Fedora Patch1: Removes date check, Fedora/RHEL specific
    Fedora Patch3: support for libdir64
    Fedora Patch4: use libresolv instead of libbind
    Fedora Patch5: USE_MM_LD_RUN_PATH
    Fedora Patch6: Skip hostname tests, due to builders not being network capable
    Fedora Patch7: Dont run one io test due to random builder failures
    Fedora Patch15: Define SONAME for
    Fedora Patch16: Install to -Dshrpdir value
    Fedora Patch22: Document Math::BigInt::CalcEmu requires Math::BigInt (CPAN RT#85015)
    Fedora Patch26: Make *DBM_File desctructors thread-safe (RT#61912)
    Fedora Patch27: Make PadlistNAMES() lvalue again (CPAN RT#101063)
    Fedora Patch28: Make magic vtable writable as a work-around for Coro (CPAN RT#101063)
    Fedora Patch29: Fix duplicating PerlIO::encoding when spawning threads (RT#31923)
    Fedora Patch30: Do not let XSLoader load relative paths (CVE-2016-6185)
    Fedora Patch31: Avoid loading optional modules from default . (CVE-2016-1238)
    Fedora Patch200: Link XS modules to with EU::CBuilder on Linux
    Fedora Patch201: Link XS modules to with EU::MM on Linux

@INC for perl 5.22.2:

Environment for perl 5.22.2:
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PERL_BADLANG (unset)

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About