develooper Front page | perl.perl5.porters | Postings from June 2015

[perl #125491] Perl regular expression bug in v5.18

Thread Next
From:
Michael Schout
Date:
June 27, 2015 06:17
Subject:
[perl #125491] Perl regular expression bug in v5.18
Message ID:
rt-4.0.18-12322-1435358069-164.125491-75-0@perl.org
# New Ticket Created by  Michael Schout 
# Please include the string:  [perl #125491]
# in the subject line of all future correspondence about this issue. 
# <URL: https://rt.perl.org/Ticket/Display.html?id=125491 >


This is a bug report for perl from mschout@gkg.net,
generated with the help of perlbug 1.39 running under perl 5.18.4.


-----------------------------------------------------------------
I have found a rather innocuous regular expression that does not work in perl
v5.18, but works in v5.16 and works again in v5.20.

Here is the regular expression substitution:

    s|^(?:\s*h)?/||i

When given the string 'h//w/b', this should strip off the "h/" at the
beginning, resulting in '/w/b'.

NOTE: This is a reduced down version from the original that triggered this bug
for me:

    s|^(?:\s*https?:)?//[^/]+||i

Under perl v5.18, this does not work:

perl 5.18.4:

$ perl -E '$x = "h//w/b"; $x =~ s|^(?:\s*h)?/||i; say $x'

    h//w/b


However, this works perfectly on v5.16.3, and also on v5.20.0:

perl 5.16.3:
$ perl -E '$x = "h//w/b"; $x =~ s|^(?:\s*h)?/||i; say $x'

    /w/b

perl 5.20.0:

$ perl -E '$x = "h//w/b"; $x =~ s|^(?:\s*h)?/||i; say $x'

    /w/b

Curiously, the following changes to the regex yield the correct results:

- remove the "\s*": s|^(?:h)?/||i   -> works correctly in 5.18
- remove the "/i":  s|^(?:\s*h)?/|| -> works correctly in 5.18

A git bisect shows that this broke in the following commit on v5.17:

b25a103638bc5111446c1e65a21db00a6e13acc1 is the first bad commit
commit b25a103638bc5111446c1e65a21db00a6e13acc1
Author: Karl Williamson <public@khwilliamson.com>
Date:   Thu Dec 27 12:34:41 2012 -0700
 
    regcomp.c: Clean up ANYOF_CLASS handling.
    
    The ANYOF_CLASS flag is used in ANYOF nodes (for [bracketed] and the
    synthetic start class) only when matching something like \w, [:punct:]
    etc., under /l (locale).  It should not be set unless /l is specified.
    However, it was always getting set for the synthetic start class.  This
    commit fixes that.  The previous code was masking errors in which it was
    being tested for unnecessarily, and for much of the 5.17 series, the
    synthetic start class was always set to test for locale, which was a
    waste of cpu when no locale was specified.
 
:100644 100644 eee952f376f066c6757208aae350b70dfb9fc9e3 bf8d7e0bab1079f84d490f4cc93e6737516ef47d M	regcomp.c
:100644 100644 8eb849cc1ec789d4c488525c587dd102298236b2 be3970d0ed9e3fca1ab95be89354c7a8ca1aeb39 M	regcomp.h
:040000 040000 b7cb07f7f1266276b13171049a9fe59bef27f891 224e7347bcc874f5265c0ac313a66e9f7489d9d3 M	

And reverse bisecting for when this bug was fixed in 5.19, I found that this bug went away with the following commit:

commit cdd87c1d4df41f9a54cccff996fa64d291adcee8
Author: Karl Williamson <public@khwilliamson.com>
Date:   Sun Sep 22 21:36:29 2013 -0600
 
    Teach regex optimizer to handle above-Latin1
    
    Until this commit, the regular expression optimizer has essentially
    punted on above-Latin1 code points.  Under some circumstances, they
    would be taken into account, more or less, but often, the generated
    synthetic start class would end up matching all above-Latin1 code
    points.  With the advent of inversion lists, it becomes feasible to
    actually fully handle such code points, as inversion lists are a
    convenient way to express arbitrary lists of code points and take their
    union, intersection, etc.  This commit changes the optimizer to use
    inversion lists for operating on the code points the synthetic start
    class can match.
    
    I don't much understand the overall operation of the optimizer.  I'm
    told that previous porters found that perturbing it caused unexpected
    behaviors.  I had promised to get this change in 5.18, but didn't.  I'm
    trying to get it in early enough into the 5.20 preliminary series that
    any problems will surface before 5.20 ships.
    
    This commit doesn't change the macro level logic, but does significantly
    change various micro level things.  Thus the 'and' and 'or' subroutines
    have been rewritten to use inversion lists.  I'm pretty confident that
    they do what their names suggest.  I re-derived the equations for what
    these operations should do, getting the same results in some cases, but
    extending others where the previous code mostly punted.  The derivations
    are given in comments in the respective routines.
    
    Some of the code is greatly simplified, as it no longer has to treat
    above-Latin1 specially.
    
    It is now feasible for /i matching of above-Latin1 code points to know
    explicitly the folds that should be in the synthetic start class.  But
    more prepatory work needs to be done before putting that into place.
    ...
 
:100644 100644 ec203f9c1f3ea42c65324e632c746042c32954f1 3dd62f946eedd99f87489fecfeb1acd86e2d250b M	embed.fnc
:100644 100644 fca8736feb457d00987232649f8b563aef4f24fa 5fc9171d233fd06b4d6cc08570cdb41f976d0b0b M	embed.h
:100644 100644 568cdf733c7784cc3f5a059583d9ae9a262ae72a 33d00ef42cf01836e9365e4a9086cf64b678b74b M	proto.h
:100644 100644 ec24583f1f7bec42e186a4241353b9688bf6f2de efefd0a17ba14144286dc55c149d6ee422b18249 M	regcomp.c
:100644 100644 f0153fc12c842c26d928d3303feb519430b73a07 eccb46690a4d252fb107044116973e3baa0cbd2c M	regcomp.h

I realize fully that 5.18 is officially "End of life", but this is a rather
severe regression and 5.18 still widespread deployment at this time, so it 
would be really nice to get this fixed in maint-5.18 somehow. Given that this
is an extremely simple regular expression I suspect there may be many other
cases where regular expressions are broken in 5.18. Perhaps the patch that fixed
this in 5.19 can be backported to maint-5.18? I am not familiar at all with
what is going on in this commit to attempt that.

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
    category=core
    severity=high
---
Site configuration information for perl 5.18.4:

Configured by mschout at Fri Jun 26 15:13:44 CDT 2015.

Summary of my perl5 (revision 5 version 18 subversion 4) configuration:
   
  Platform:
    osname=darwin, osvers=14.3.0, archname=darwin-2level
    uname='darwin lothlorien.snowcrash.lan 14.3.0 darwin kernel version 14.3.0: mon mar 23 11:59:05 pdt 2015; root:xnu-2782.20.48~5release_x86_64 x86_64 '
    config_args='-de -Dprefix=/Users/mschout/perl5/perlbrew/perls/perl-5.18.4 -Accflags=-fPIC -Aeval:scriptdir=/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/bin'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fPIC -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include',
    optimize='-O3',
    cppflags='-fno-common -DPERL_DARWIN -fPIC -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include'
    ccversion='', gccversion='4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -fstack-protector -L/opt/local/lib'
    libpth=/opt/local/lib /usr/lib
    libs=-lgdbm -ldbm -ldl -lm -lutil -lc
    perllibs=-ldl -lm -lutil -lc
    libc=, so=dylib, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/opt/local/lib -fstack-protector'

Locally applied patches:
    

---
@INC for perl 5.18.4:
    /Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/site_perl/5.18.4/darwin-2level
    /Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/site_perl/5.18.4
    /Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/5.18.4/darwin-2level
    /Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/5.18.4
    .

---
Environment for perl 5.18.4:
    DYLD_LIBRARY_PATH (unset)
    HOME=/Users/mschout
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/Users/mschout/perl5/perlbrew/bin:/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/bin:/opt/local/bin:/opt/local/sbin:/opt/local/bin:/opt/local/sbin:/Users/mschout/.rvm/gems/ruby-2.1.1/bin:/Users/mschout/.rvm/gems/ruby-2.1.1@global/bin:/Users/mschout/.rvm/rubies/ruby-2.1.1/bin:/opt/local/bin:/opt/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/mschout/bin:/Users/mschout/Dropbox/bin:/opt/local/libexec/gnubin:/Users/mschout/.rvm/bin:/Users/mschout/.rvm/bin
    PERLBREW_BASHRC_VERSION=0.69
    PERLBREW_HOME=/Users/mschout/.perlbrew
    PERLBREW_MANPATH=/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/man
    PERLBREW_PATH=/Users/mschout/perl5/perlbrew/bin:/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/bin
    PERLBREW_PERL=perl-5.18.4
    PERLBREW_ROOT=/Users/mschout/perl5/perlbrew
    PERLBREW_VERSION=0.69
    PERL_BADLANG (unset)
    PERL_LOCAL_LIB_ROOT=
    SHELL=/bin/bash


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About