Front page | perl.perl5.porters |
Postings from February 2015
[perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
Thread Next
From:
perlbug-followup
Date:
February 6, 2015 19:56
Subject:
[perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
Message ID:
rt-4.0.18-10541-1423151177-1422.123743-75-0@perl.org
# New Ticket Created by
# Please include the string: [perl #123743]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/Ticket/Display.html?id=123743 >
This is a bug report for perl from warren.hyde@amd.com,
generated with the help of perlbug 1.39 running under perl 5.18.2.
-----------------------------------------------------------------
[Please describe your issue here]
Certain types of (poorly-written) regular expressions have become
painfully slow since the refactoring of regexec.c at some point in
the 5.17 versions. I believe this to be a bug in the behavior, as
the RegEx engine is doing much more (wasted) work trying to match
certain patterns that previous Perl versions (through 5.16.1) had
no problems with.
Specifically, consider this (good, previous behavior) example:
5.16.1> $ perl5.16.1 -Mre=debug -e '"00000000: 0c" =~ /.*:\s*ab/i'
5.16.1> Compiling REx ".*:\s*ab"
5.16.1> Final program:
5.16.1> 1: STAR (3)
5.16.1> 2: REG_ANY (0)
5.16.1> 3: EXACTF <:> (5)
5.16.1> 5: STAR (7)
5.16.1> 6: SPACE (0)
5.16.1> 7: EXACTF <ab> (9)
5.16.1> 9: END (0)
5.16.1> anchored(MBOL) implicit minlen 3
5.16.1> Matching REx ".*:\s*ab" against "00000000: 0c"
5.16.1> 0 <> <00000000: > | 1:STAR(3)
5.16.1> REG_ANY can match 12 times out of 2147483647...
5.16.1> 8 <00000000> <: 0c> | 3: EXACTF <:>(5)
5.16.1> 9 <00000000:> < 0c> | 5: STAR(7)
5.16.1> SPACE can match 1 times out of 2147483647...
5.16.1> failed...
5.16.1> failed...
5.16.1> Match failed
5.16.1> Freeing REx: ".*:\s*ab"
(Using case-insensitive matching disallows the optimizer from looking
for the fixed text "ab".)
Now, compare that with the work done by Perl since 5.18:
5.18.2> $ perl5.18.2 -Mre=debug -e '"00000000: 0c" =~ /.*:\s*ab/i'
5.18.2> Compiling REx ".*:\s*ab"
5.18.2> Final program:
5.18.2> 1: STAR (3)
5.18.2> 2: REG_ANY (0)
5.18.2> 3: EXACT <:> (5)
5.18.2> 5: STAR (7)
5.18.2> 6: POSIXD[\s] (0)
5.18.2> 7: EXACTF <ab> (9)
5.18.2> 9: END (0)
5.18.2> floating ":" at 0..2147483647 (checking floating) anchored(MBOL) implicit minlen 3
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00000000: 0c"
5.18.2> Found floating substr ":" at offset 8...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2> Matching REx ".*:\s*ab" against "00000000: 0c"
5.18.2> 0 <> <00000000: > | 1:STAR(3)
5.18.2> REG_ANY can match 12 times out of 2147483647...
5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5)
5.18.2> 9 <00000000:> < 0c> | 5: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0000000: 0c"
5.18.2> Found floating substr ":" at offset 7...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2> 1 <0> <0000000: 0> | 1:STAR(3)
5.18.2> REG_ANY can match 11 times out of 2147483647...
5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5)
5.18.2> 9 <00000000:> < 0c> | 5: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "000000: 0c"
5.18.2> Found floating substr ":" at offset 6...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2> 2 <00> <000000: 0c> | 1:STAR(3)
5.18.2> REG_ANY can match 10 times out of 2147483647...
5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5)
5.18.2> 9 <00000000:> < 0c> | 5: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00000: 0c"
5.18.2> Found floating substr ":" at offset 5...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2> 3 <000> <00000: 0c> | 1:STAR(3)
5.18.2> REG_ANY can match 9 times out of 2147483647...
5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5)
5.18.2> 9 <00000000:> < 0c> | 5: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0000: 0c"
5.18.2> Found floating substr ":" at offset 4...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2> 4 <0000> <0000: 0c> | 1:STAR(3)
5.18.2> REG_ANY can match 8 times out of 2147483647...
5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5)
5.18.2> 9 <00000000:> < 0c> | 5: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "000: 0c"
5.18.2> Found floating substr ":" at offset 3...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2> 5 <00000> <000: 0c> | 1:STAR(3)
5.18.2> REG_ANY can match 7 times out of 2147483647...
5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5)
5.18.2> 9 <00000000:> < 0c> | 5: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00: 0c"
5.18.2> Found floating substr ":" at offset 2...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2> 6 <000000> <00: 0c> | 1:STAR(3)
5.18.2> REG_ANY can match 6 times out of 2147483647...
5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5)
5.18.2> 9 <00000000:> < 0c> | 5: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0: 0c"
5.18.2> Found floating substr ":" at offset 1...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2> 7 <0000000> <0: 0c> | 1:STAR(3)
5.18.2> REG_ANY can match 5 times out of 2147483647...
5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5)
5.18.2> 9 <00000000:> < 0c> | 5: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against ": 0c"
5.18.2> Found floating substr ":" at offset 0...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed: match at offset 0
5.18.2> 8 <00000000> <: 0c> | 1:STAR(3)
5.18.2> REG_ANY can match 4 times out of 2147483647...
5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5)
5.18.2> 9 <00000000:> < 0c> | 5: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against " 0c"
5.18.2> Did not find floating substr ":"...
5.18.2> Match rejected by optimizer
5.18.2> Match failed
5.18.2> Freeing REx: ".*:\s*ab"
Why does it walk forward through the string trying to guess the start of
the match again when all prior Perls know to give up after the first
fail? This new behavior is causing massive wasted processing for a
certain class of regular expression, namely the naiive use of ".*",
which (rather unfortulately) shows up very often in the real world.
Since 5.18 (or perhaps 5.17.x), the performance of this regular expression
degrades geometrically with the length of the line before the colon,
which caused a big efficiency problem. I know it's not a good example
of a regex, but this presumes the coder knows how the internals of regex
matching works.
You can see how badly this affects things with the following comparison:
5.16.1> $ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.16.1 -ne '/.*:\s*ab/i'
5.16.1> 0.00user 0.00system 0:00.01elapsed 0%CPU (0avgtext+0avgdata 7488maxresident)k
5.16.1> 0inputs+0outputs (0major+515minor)pagefaults 0swaps
5.18.2> $ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.18.2 -ne '/.*:\s*ab/i'
5.18.2> 4.31user 0.00system 0:04.41elapsed 97%CPU (0avgtext+0avgdata 7296maxresident)k
5.18.2> 0inputs+0outputs (0major+503minor)pagefaults 0swaps
Spending almost 4.5 seconds matching this regular expression against 1000
lines of text is a big step back in the performance of the flagship text
processing language. Something is amiss when backtracking ".*" now.
[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
category=core
severity=medium
---
Site configuration information for perl 5.18.2:
Configured by pandora at Tue May 6 14:41:42 EDT 2014.
Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
Platform:
osname=linux, osvers=2.6.18-308.8.2.el5, archname=x86_64-linux-thread-multi
uname='linux atlvbuild-el5-64 2.6.18-308.8.2.el5 #1 smp tue may 29 11:54:17 edt 2012 x86_64 x86_64 x86_64 gnulinux '
config_args='-des -Dcc=gcc -Dccflags=-fPIC -Dcflags=-fPIC -Uinstallusrbinperl -Dperladmin=pandora-admin@mpdtxmail -Dprefix=/tool/pandora64/.package/perl-5.18.2-gcc481 -Dstartperl=#!/tool/pandora64/.package/perl-5.18.2-gcc481/bin/perl -Dlibpth=/tool/pandora64/lib /lib64 /usr/lib64 -Duseshrplib -Duseithreads -Dusethreads -Duselargefiles -Duse64bitint -Dd_semctl_semun -Di_db -Ui_ndbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Ubincompat5005 -Uversiononly -Dpager=/tool/pandora64/bin/less -isr'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fPIC -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2',
cppflags='-D_REENTRANT -D_GNU_SOURCE -fPIC -fno-strict-aliasing -pipe -fstack-protector'
ccversion='', gccversion='4.8.1', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=undef, longdblsize=
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='gcc', ldflags =' -fstack-protector'
libpth=/tool/pandora64/lib /lib64 /usr/lib64
libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=/lib/libc-2.5.so, so=so, useshrplib=true, libperl=libperl.so
gnulibc_version='2.5'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2/x86_64-linux-thread-multi/CORE'
cccdlflags='-fPIC', lddlflags='-shared -O2 -fstack-protector'
Locally applied patches:
---
@INC for perl 5.18.2:
/proj/verif_release_ro/p4w/current
/proj/verif_release_ro/p4_mkwa/current
/proj/verif_release_ro/siteconfig/current/lib/perl
/proj/verif_release_ro/error_reporting/25/lib/perl
/proj/verif_release_ro/cbwa_bootcore/current/lib/perl
/tool/pandora64/.package/perl-5.18.2-gcc481/lib/site_perl/5.18.2/x86_64-linux-thread-multi
/tool/pandora64/.package/perl-5.18.2-gcc481/lib/site_perl/5.18.2
/tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2/x86_64-linux-thread-multi
/tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2
.
---
Environment for perl 5.18.2:
HOME=/home/whyde
LANG=C
LANGUAGE (unset)
LC_COLLATE=C
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/tool/pandora64/.package/perl-5.18.2-gcc481/bin:/tool/pandora64/.package/perl-5.18.1/bin:/proj/verif_release_ro/p4w/current:/proj/verif_release_ro/p4_mkwa/current:/tool/pandora64/.package/subversion-1.6.3/bin:/proj/verif_release_ro/svn_ssh/31/bin:/tool/pandora64/.package/tcltk-8.4.6-a/bin:/tool/pandora64/.package/perl-5.8.8/bin:/proj/verif_release_ro/siteconfig/current/bin:/tool/pandora64/bin:/tool/pandora/bin:/proj/verif_release_ro/cbwa_bootcore/current/bin:/bin:/usr/bin:/usr/lib64/qt-3.3/bin:/usr/X11R6/bin:/usr/kerberos/bin:/sbin:/usr/sbin:/tool/lsf/bin:/home/whyde/bin:/usr/NX/bin:/usr/NX/bin:/proj/soc_infra/bin
PERL5LIB=/proj/verif_release_ro/p4w/current:/proj/verif_release_ro/p4_mkwa/current:/proj/verif_release_ro/siteconfig/current/lib/perl:/proj/verif_release_ro/error_reporting/25/lib/perl:/proj/verif_release_ro/cbwa_bootcore/current/lib/perl
PERL_BADLANG (unset)
PERL_HOME=/tool/pandora64/.package/perl-5.18.2-gcc481
SHELL=/tool/pandora/binlatest/tcsh
Thread Next
-
[perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
by perlbug-followup