Front page | perl.perl5.porters |
Postings from February 2009
regexp iteration limits
Thread Next
From:
Zefram
Date:
February 9, 2009 16:17
Subject:
regexp iteration limits
Message ID:
20090210001739.GC2263@fysh.org
I've been looking at a class of bugs in the regexp engine, wherein the
"*" and "+" operators act like "{0,32767}" and "{1,32767}" respectively.
I've found three different cases of it, that occur in different versions
of perl. Back in October I reported bug #60034, originally with a test
case that amounted to this:
$ for pver in 5.{6.2,8.{0,8,9},9.4,10.0}; do echo $pver $(~/usr/perl/util/perlver $pver-i32-f52 perl -lwe '$SIG{__WARN__}=sub{print"(warn) "}; $a="xyzt"x10000; print $a =~ /\A(?>[a-z])*\z/ ? "ok" : "bug"'); done
5.6.2 ok
5.8.0 ok
5.8.8 ok
5.8.9 ok
5.9.4 bug
5.10.0 bug
$
I then discovered and announced this variation on it:
$ for pver in 5.{8.{0,8,9},9.4,10.0}; do echo $pver $(~/usr/perl/util/perlver $pver-i32-f52 perl -lwe '$SIG{__WARN__}=sub{print"(warn) "}; $a="xyzt"x10000; utf8::upgrade($a); print $a =~ /\A(?>[a-z])*\z/ ? "ok" : "bug"'); done
5.8.0 bug
5.8.8 bug
5.8.9 ok
5.9.4 bug
5.10.0 bug
$
As you see, a fix for this case went into 5.8.9. Did they get fixed
in 5.10? I haven't noticed anyone announce a code patch for 5.10 or
blead, but the bug has been closed in RT.
Anyway, I'm raising the issue again because I've found the same problem
happening in a different way:
$ for pver in 5.{6.2,8.{0,8,9},9.4,10.0}; do echo $pver $(~/usr/perl/util/perlver $pver-i32-f52 perl -lwe '$SIG{__WARN__}=sub{print"(warn) "}; $a="xyzt"x10000; print $a =~ /\A(?:X?[a-z])*\z/ ? "ok" : "bug"'); done
5.6.2 (warn) bug
5.8.0 (warn) bug
5.8.8 (warn) bug
5.8.9 (warn) bug
5.9.4 (warn) bug
5.10.0 (warn) bug
$
The warning that this case emits is "Complex regular subexpression
recursion limit (32766) exceeded". With exactly the right number of
repeats it's possible to get both the warning and the right answer.
This failure is particularly annoying because this regexp, like the
ones I'm developing where I originally ran into this, doesn't need any
backtracking capability -- it can never have a successful match after
a backtrack -- so there's no need for the execution to recurse at all.
I can explicitly tell perl that I don't want backtracking, by writing the
pattern as /\A(?>(?:X?[a-z])*)\z/ or (in 5.10) as /\A(?:X?[a-z])*+\z/,
but it doesn't pick up on these hints, and produces the same behaviour.
Bizarrely, an actual honest-to-Eris *recursive* regexp doesn't encounter
any such limit, nor generate the recursion warning:
$ for pver in 5.{6.2,8.{0,8,9},9.4,10.0}; do echo $pver $(~/usr/perl/util/perlver $pver-i32-f52 perl -lwe '$SIG{__WARN__}=sub{print"(warn) "}; $n=40000; $a=("<"x$n).(">"x$n); $bal=qr/(?:<X?(??{$main::bal})Y?>)?/; print $a =~ /\A$bal\z/o ? "ok" : "bug"'); done
5.6.2 ok
5.8.0 ok
5.8.8 ok
5.8.9 ok
5.9.4 ok
5.10.0 ok
$
(Damn, I thought I could simply document that there's no workaround
for the iteration limit, but now I'm thinking about workarounds using
recursive structures. Perhaps I can still claim that there's no *sane*
workaround.)
The recursion warning suggests that that form of this bug is an ingrained
aspect of the current regexp engine's structure. But it seems a pretty
devastating bug, causing such fundamental misbehaviour in a core feature,
so I think it's important to fix. I'm heading towards documenting that
perl just can't cope with inputs over 32 Ki characters.
Secondarily, perl ought to pay attention to (?>...), to avoid generating
backtracking records that it can never use.
Anyone want to tell me this is impossible?
-zefram
Thread Next
-
regexp iteration limits
by Zefram