develooper Front page | perl.perl5.porters | Postings from April 2011

Re: [perl #76546] regex engine slowdown bug

Thread Previous | Thread Next
From:
Reini Urban
Date:
April 6, 2011 05:31
Subject:
Re: [perl #76546] regex engine slowdown bug
Message ID:
BANLkTi=Z+k9z0FxuAdyFq-SZYrN0xZ1ZHA@mail.gmail.com
2011/4/1 demerphq <demerphq@gmail.com>:
> On 31 March 2011 12:05, Alex Efros via RT <perlbug-followup@perl.org> wrote:
>> Чтв. Мар. 31 01:16:36 2011, demerphq писал:
>>> We never fixed the superlinear cache bug that was broken some time
>>> around when the engine was derecursivized.
>>>
>>> Id bet a dollar this is related.
>>
>> BTW, your YAPH signature remind me about -Mre=debug, and I tried to
>> compare these regexps this way. Results are interesting, but, sadly, I
>> don't know perl internals good enough to understand them:
>>
>> $ perl -Mre=debug -e "/<X[^>]*>(.*?)(?=<a)/gis"
>> and
>> $ perl -Mre=debug -e "/<X[^>]*(?i-:>)(.*?)(?=<a)/gis"
>> produce same output and work fast, but this one differs and slow:
>> $ perl -Mre=debug -e "/<X[^>]*(?-i:>)(.*?)(?=<a)/gis"
>>
>> Last line in first two (fast) regex was:
>>  stclass EXACTF <<X> minlen 3
>> but in last (slow) regex it was:
>>  floating ">" at 2..2147483647 (checking floating) stclass EXACTF <<X>
>> minlen 3
>>
>> Right now we've huge amount of parsers working 24x7 with thousands of
>> different regexps, and it's SLOW. Can anyone give some recommendations
>> about how to work around this issue - like "add /i flag to every regex
>> whenever possible", etc.? Because for now I'm even doesn't sure adding
>> this redundant /i doesn't slowdown one regexp while speeding up another...
>
> You are encountering an optimisation that normally speeds things up by
> allowing perl to fail fast.
>
> This is called the "longest constant string" optimization. During the
> optimization phase we scan the pattern and calculate the position, and
> value of the longest constant strings which are at a constant and
> floating offset.
>
> So in your "slow" pattern that would be for constant offset '<X' and
> for floating offset ' >'.
>
> For various reasons we prefer the floating string for this check.
>
> My guess is that the performance impact here is that your data is long
> has a LOT of '>' in it, which would then result in perl scanning to
> find the _last_ (iirc). Which is probably not the right thing.
>
> Now for various reasons this optimisation is disabled with a case
> insensitive match. Which means that perl no longer sees the ">" and no
> longer considers it a viable optimisation target, and therfore goes
> with a different strategy which is to look for the "<X" and then match
> from there. Your data probably makes this a much better strategy.
>
> Its hard to say what to do here. *Most* times this optimization is
> highly useful. In your case clearly not.
>
> There certainly is not a generic solution here. For instance enabling
> /i might make a few of your patterns faster, but likely will slow all
> the rest down.

Just in case someone else wants to add /i to make regex search faster:
DO NOT add /i in the general case.

In the general case and esp. for short scripts adding /i makes regex
searches slower. It has to demand-load the whole unicore/To/Fold case tables
(that's 1.6MB on 32bit), and searches for all unicode cases also.

utf8_heavy.pl is demand-loaded, also Carp is loaded.
Carp loads warnings which in turn loads all warnings::register_categories
which is about 68kB on 32-bit.
-- 
Reini Urban
http://phpwiki.org/           http://murbreak.at/

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About