develooper Front page | perl.perl5.porters | Postings from April 2011

Re: [perl #76546] regex engine slowdown bug

Thread Previous | Thread Next
April 1, 2011 01:59
Re: [perl #76546] regex engine slowdown bug
Message ID:
On 31 March 2011 12:05, Alex Efros via RT <> wrote:
> Чтв. Мар. 31 01:16:36 2011, demerphq писал:
>> We never fixed the superlinear cache bug that was broken some time
>> around when the engine was derecursivized.
>> Id bet a dollar this is related.
> BTW, your YAPH signature remind me about -Mre=debug, and I tried to
> compare these regexps this way. Results are interesting, but, sadly, I
> don't know perl internals good enough to understand them:
> $ perl -Mre=debug -e "/<X[^>]*>(.*?)(?=<a)/gis"
> and
> $ perl -Mre=debug -e "/<X[^>]*(?i-:>)(.*?)(?=<a)/gis"
> produce same output and work fast, but this one differs and slow:
> $ perl -Mre=debug -e "/<X[^>]*(?-i:>)(.*?)(?=<a)/gis"
> Last line in first two (fast) regex was:
>  stclass EXACTF <<X> minlen 3
> but in last (slow) regex it was:
>  floating ">" at 2..2147483647 (checking floating) stclass EXACTF <<X>
> minlen 3
> Right now we've huge amount of parsers working 24x7 with thousands of
> different regexps, and it's SLOW. Can anyone give some recommendations
> about how to work around this issue - like "add /i flag to every regex
> whenever possible", etc.? Because for now I'm even doesn't sure adding
> this redundant /i doesn't slowdown one regexp while speeding up another...

You are encountering an optimisation that normally speeds things up by
allowing perl to fail fast.

This is called the "longest constant string" optimization. During the
optimization phase we scan the pattern and calculate the position, and
value of the longest constant strings which are at a constant and
floating offset.

So in your "slow" pattern that would be for constant offset '<X' and
for floating offset ' >'.

For various reasons we prefer the floating string for this check.

My guess is that the performance impact here is that your data is long
has a LOT of '>' in it, which would then result in perl scanning to
find the _last_ (iirc). Which is probably not the right thing.

Now for various reasons this optimisation is disabled with a case
insensitive match. Which means that perl no longer sees the ">" and no
longer considers it a viable optimisation target, and therfore goes
with a different strategy which is to look for the "<X" and then match
from there. Your data probably makes this a much better strategy.

Its hard to say what to do here. *Most* times this optimization is
highly useful. In your case clearly not.

There certainly is not a generic solution here. For instance enabling
/i might make a few of your patterns faster, but likely will slow all
the rest down.

You will have to do analysis.

Also some of your patterns are potentially super-linear. Which we used
to catch at the perl level.

However if done right you will find things like (?>....) will change
them to being linear.

For various reasons Perl's engine is not a true regular expression
engine in the formal sense, and as such you sometimes have to manually
 provide it hints so it doesnt backtrack excessively.

> WBR, Alex.
> P.S. I hope one lucky day we'll find libre2 support in perl, using some
> pragma to switch between current perl regexp engine and libre2 - they
> are mostly compatible, but libre2 guarantee constant execution time and
> I'm really tired of regexps which at some moment decide to spend few
> minutes for execution instead of nanoseconds.

While i havent checked IMO libre2 cannot make that commitment and
support the full syntax of perl.

Also, I do not believe that a proper NFA/DFA can provide left-most
longest semantics.

However, we have a regex plug in interface in Perl, and have done so
since Perl 5.10

Perhaps your company should hire somebody to implement a regex plug in
that uses libre2.


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About