develooper Front page | perl.perl5.porters | Postings from February 2015

[perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)

Thread Next
From:
perlbug-followup
Date:
February 6, 2015 19:56
Subject:
[perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
Message ID:
rt-4.0.18-10541-1423151177-1422.123743-75-0@perl.org
# New Ticket Created by   
# Please include the string:  [perl #123743]
# in the subject line of all future correspondence about this issue. 
# <URL: https://rt.perl.org/Ticket/Display.html?id=123743 >



This is a bug report for perl from warren.hyde@amd.com,
generated with the help of perlbug 1.39 running under perl 5.18.2.


-----------------------------------------------------------------
[Please describe your issue here]

Certain types of (poorly-written) regular expressions have become
painfully slow since the refactoring of regexec.c at some point in
the 5.17 versions. I believe this to be a bug in the behavior, as
the RegEx engine is doing much more (wasted) work trying to match
certain patterns that previous Perl versions (through 5.16.1) had
no problems with.

Specifically, consider this (good, previous behavior) example:

5.16.1> $ perl5.16.1 -Mre=debug -e '"00000000: 0c" =~ /.*:\s*ab/i'
5.16.1> Compiling REx ".*:\s*ab"
5.16.1> Final program:
5.16.1>    1: STAR (3)
5.16.1>    2:   REG_ANY (0)
5.16.1>    3: EXACTF <:> (5)
5.16.1>    5: STAR (7)
5.16.1>    6:   SPACE (0)
5.16.1>    7: EXACTF <ab> (9)
5.16.1>    9: END (0)
5.16.1> anchored(MBOL) implicit minlen 3
5.16.1> Matching REx ".*:\s*ab" against "00000000: 0c"
5.16.1>    0 <> <00000000: >         |  1:STAR(3)
5.16.1>                                   REG_ANY can match 12 times out of 2147483647...
5.16.1>    8 <00000000> <: 0c>       |  3:  EXACTF <:>(5)
5.16.1>    9 <00000000:> < 0c>       |  5:  STAR(7)
5.16.1>                                     SPACE can match 1 times out of 2147483647...
5.16.1>                                     failed...
5.16.1>                                   failed...
5.16.1> Match failed
5.16.1> Freeing REx: ".*:\s*ab"

(Using case-insensitive matching disallows the optimizer from looking
for the fixed text "ab".)

Now, compare that with the work done by Perl since 5.18:

5.18.2> $ perl5.18.2 -Mre=debug -e '"00000000: 0c" =~ /.*:\s*ab/i'
5.18.2> Compiling REx ".*:\s*ab"
5.18.2> Final program:
5.18.2>    1: STAR (3)
5.18.2>    2:   REG_ANY (0)
5.18.2>    3: EXACT <:> (5)
5.18.2>    5: STAR (7)
5.18.2>    6:   POSIXD[\s] (0)
5.18.2>    7: EXACTF <ab> (9)
5.18.2>    9: END (0)
5.18.2> floating ":" at 0..2147483647 (checking floating) anchored(MBOL) implicit minlen 3 
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00000000: 0c"
5.18.2> Found floating substr ":" at offset 8...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2> Matching REx ".*:\s*ab" against "00000000: 0c"
5.18.2>    0 <> <00000000: >         |  1:STAR(3)
5.18.2>                                   REG_ANY can match 12 times out of 2147483647...
5.18.2>    8 <00000000> <: 0c>       |  3:  EXACT <:>(5)
5.18.2>    9 <00000000:> < 0c>       |  5:  STAR(7)
5.18.2>                                     POSIXD[\s] can match 1 times out of 2147483647...
5.18.2>                                     failed...
5.18.2>                                   failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0000000: 0c"
5.18.2> Found floating substr ":" at offset 7...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2>    1 <0> <0000000: 0>        |  1:STAR(3)
5.18.2>                                   REG_ANY can match 11 times out of 2147483647...
5.18.2>    8 <00000000> <: 0c>       |  3:  EXACT <:>(5)
5.18.2>    9 <00000000:> < 0c>       |  5:  STAR(7)
5.18.2>                                     POSIXD[\s] can match 1 times out of 2147483647...
5.18.2>                                     failed...
5.18.2>                                   failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "000000: 0c"
5.18.2> Found floating substr ":" at offset 6...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2>    2 <00> <000000: 0c>       |  1:STAR(3)
5.18.2>                                   REG_ANY can match 10 times out of 2147483647...
5.18.2>    8 <00000000> <: 0c>       |  3:  EXACT <:>(5)
5.18.2>    9 <00000000:> < 0c>       |  5:  STAR(7)
5.18.2>                                     POSIXD[\s] can match 1 times out of 2147483647...
5.18.2>                                     failed...
5.18.2>                                   failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00000: 0c"
5.18.2> Found floating substr ":" at offset 5...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2>    3 <000> <00000: 0c>       |  1:STAR(3)
5.18.2>                                   REG_ANY can match 9 times out of 2147483647...
5.18.2>    8 <00000000> <: 0c>       |  3:  EXACT <:>(5)
5.18.2>    9 <00000000:> < 0c>       |  5:  STAR(7)
5.18.2>                                     POSIXD[\s] can match 1 times out of 2147483647...
5.18.2>                                     failed...
5.18.2>                                   failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0000: 0c"
5.18.2> Found floating substr ":" at offset 4...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2>    4 <0000> <0000: 0c>       |  1:STAR(3)
5.18.2>                                   REG_ANY can match 8 times out of 2147483647...
5.18.2>    8 <00000000> <: 0c>       |  3:  EXACT <:>(5)
5.18.2>    9 <00000000:> < 0c>       |  5:  STAR(7)
5.18.2>                                     POSIXD[\s] can match 1 times out of 2147483647...
5.18.2>                                     failed...
5.18.2>                                   failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "000: 0c"
5.18.2> Found floating substr ":" at offset 3...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2>    5 <00000> <000: 0c>       |  1:STAR(3)
5.18.2>                                   REG_ANY can match 7 times out of 2147483647...
5.18.2>    8 <00000000> <: 0c>       |  3:  EXACT <:>(5)
5.18.2>    9 <00000000:> < 0c>       |  5:  STAR(7)
5.18.2>                                     POSIXD[\s] can match 1 times out of 2147483647...
5.18.2>                                     failed...
5.18.2>                                   failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00: 0c"
5.18.2> Found floating substr ":" at offset 2...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2>    6 <000000> <00: 0c>       |  1:STAR(3)
5.18.2>                                   REG_ANY can match 6 times out of 2147483647...
5.18.2>    8 <00000000> <: 0c>       |  3:  EXACT <:>(5)
5.18.2>    9 <00000000:> < 0c>       |  5:  STAR(7)
5.18.2>                                     POSIXD[\s] can match 1 times out of 2147483647...
5.18.2>                                     failed...
5.18.2>                                   failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0: 0c"
5.18.2> Found floating substr ":" at offset 1...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2>    7 <0000000> <0: 0c>       |  1:STAR(3)
5.18.2>                                   REG_ANY can match 5 times out of 2147483647...
5.18.2>    8 <00000000> <: 0c>       |  3:  EXACT <:>(5)
5.18.2>    9 <00000000:> < 0c>       |  5:  STAR(7)
5.18.2>                                     POSIXD[\s] can match 1 times out of 2147483647...
5.18.2>                                     failed...
5.18.2>                                   failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against ": 0c"
5.18.2> Found floating substr ":" at offset 0...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2>    8 <00000000> <: 0c>       |  1:STAR(3)
5.18.2>                                   REG_ANY can match 4 times out of 2147483647...
5.18.2>    8 <00000000> <: 0c>       |  3:  EXACT <:>(5)
5.18.2>    9 <00000000:> < 0c>       |  5:  STAR(7)
5.18.2>                                     POSIXD[\s] can match 1 times out of 2147483647...
5.18.2>                                     failed...
5.18.2>                                   failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against " 0c"
5.18.2> Did not find floating substr ":"...
5.18.2> Match rejected by optimizer
5.18.2> Match failed
5.18.2> Freeing REx: ".*:\s*ab"

Why does it walk forward through the string trying to guess the start of
the match again when all prior Perls know to give up after the first
fail? This new behavior is causing massive wasted processing for a
certain class of regular expression, namely the naiive use of ".*",
which (rather unfortulately) shows up very often in the real world.

Since 5.18 (or perhaps 5.17.x), the performance of this regular expression
degrades geometrically with the length of the line before the colon,
which caused a big efficiency problem. I know it's not a good example
of a regex, but this presumes the coder knows how the internals of regex
matching works.

You can see how badly this affects things with the following comparison:

5.16.1> $ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.16.1 -ne '/.*:\s*ab/i'
5.16.1> 0.00user 0.00system 0:00.01elapsed 0%CPU (0avgtext+0avgdata 7488maxresident)k
5.16.1> 0inputs+0outputs (0major+515minor)pagefaults 0swaps

5.18.2> $ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.18.2 -ne '/.*:\s*ab/i'
5.18.2> 4.31user 0.00system 0:04.41elapsed 97%CPU (0avgtext+0avgdata 7296maxresident)k
5.18.2> 0inputs+0outputs (0major+503minor)pagefaults 0swaps

Spending almost 4.5 seconds matching this regular expression against 1000
lines of text is a big step back in the performance of the flagship text
processing language. Something is amiss when backtracking ".*" now.

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
    category=core
    severity=medium
---
Site configuration information for perl 5.18.2:

Configured by pandora at Tue May  6 14:41:42 EDT 2014.

Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
   
  Platform:
    osname=linux, osvers=2.6.18-308.8.2.el5, archname=x86_64-linux-thread-multi
    uname='linux atlvbuild-el5-64 2.6.18-308.8.2.el5 #1 smp tue may 29 11:54:17 edt 2012 x86_64 x86_64 x86_64 gnulinux '
    config_args='-des -Dcc=gcc -Dccflags=-fPIC -Dcflags=-fPIC -Uinstallusrbinperl -Dperladmin=pandora-admin@mpdtxmail -Dprefix=/tool/pandora64/.package/perl-5.18.2-gcc481 -Dstartperl=#!/tool/pandora64/.package/perl-5.18.2-gcc481/bin/perl -Dlibpth=/tool/pandora64/lib /lib64 /usr/lib64 -Duseshrplib -Duseithreads -Dusethreads -Duselargefiles -Duse64bitint -Dd_semctl_semun -Di_db -Ui_ndbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Ubincompat5005 -Uversiononly -Dpager=/tool/pandora64/bin/less -isr'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fPIC -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fPIC -fno-strict-aliasing -pipe -fstack-protector'
    ccversion='', gccversion='4.8.1', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=undef, longdblsize=
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -fstack-protector'
    libpth=/tool/pandora64/lib /lib64 /usr/lib64
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.5.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2/x86_64-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -fstack-protector'

Locally applied patches:
    

---
@INC for perl 5.18.2:
    /proj/verif_release_ro/p4w/current
    /proj/verif_release_ro/p4_mkwa/current
    /proj/verif_release_ro/siteconfig/current/lib/perl
    /proj/verif_release_ro/error_reporting/25/lib/perl
    /proj/verif_release_ro/cbwa_bootcore/current/lib/perl
    /tool/pandora64/.package/perl-5.18.2-gcc481/lib/site_perl/5.18.2/x86_64-linux-thread-multi
    /tool/pandora64/.package/perl-5.18.2-gcc481/lib/site_perl/5.18.2
    /tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2/x86_64-linux-thread-multi
    /tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2
    .

---
Environment for perl 5.18.2:
    HOME=/home/whyde
    LANG=C
    LANGUAGE (unset)
    LC_COLLATE=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/tool/pandora64/.package/perl-5.18.2-gcc481/bin:/tool/pandora64/.package/perl-5.18.1/bin:/proj/verif_release_ro/p4w/current:/proj/verif_release_ro/p4_mkwa/current:/tool/pandora64/.package/subversion-1.6.3/bin:/proj/verif_release_ro/svn_ssh/31/bin:/tool/pandora64/.package/tcltk-8.4.6-a/bin:/tool/pandora64/.package/perl-5.8.8/bin:/proj/verif_release_ro/siteconfig/current/bin:/tool/pandora64/bin:/tool/pandora/bin:/proj/verif_release_ro/cbwa_bootcore/current/bin:/bin:/usr/bin:/usr/lib64/qt-3.3/bin:/usr/X11R6/bin:/usr/kerberos/bin:/sbin:/usr/sbin:/tool/lsf/bin:/home/whyde/bin:/usr/NX/bin:/usr/NX/bin:/proj/soc_infra/bin
    PERL5LIB=/proj/verif_release_ro/p4w/current:/proj/verif_release_ro/p4_mkwa/current:/proj/verif_release_ro/siteconfig/current/lib/perl:/proj/verif_release_ro/error_reporting/25/lib/perl:/proj/verif_release_ro/cbwa_bootcore/current/lib/perl
    PERL_BADLANG (unset)
    PERL_HOME=/tool/pandora64/.package/perl-5.18.2-gcc481
    SHELL=/tool/pandora/binlatest/tcsh


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About