develooper Front page | perl.perl5.porters | Postings from April 2011

Re: [perl #76546] regex engine slowdown bug

Thread Previous | Thread Next
From:
Karl Williamson
Date:
April 9, 2011 12:21
Subject:
Re: [perl #76546] regex engine slowdown bug
Message ID:
4DA0B1A5.8070003@khwilliamson.com
On 04/06/2011 06:31 AM, Reini Urban wrote:
> 2011/4/1 demerphq<demerphq@gmail.com>:
>> On 31 March 2011 12:05, Alex Efros via RT<perlbug-followup@perl.org>  wrote:
>>> Чтв. Мар. 31 01:16:36 2011, demerphq писал:
>>>> We never fixed the superlinear cache bug that was broken some time
>>>> around when the engine was derecursivized.
>>>>
>>>> Id bet a dollar this is related.
>>>
>>> BTW, your YAPH signature remind me about -Mre=debug, and I tried to
>>> compare these regexps this way. Results are interesting, but, sadly, I
>>> don't know perl internals good enough to understand them:
>>>
>>> $ perl -Mre=debug -e "/<X[^>]*>(.*?)(?=<a)/gis"
>>> and
>>> $ perl -Mre=debug -e "/<X[^>]*(?i-:>)(.*?)(?=<a)/gis"
>>> produce same output and work fast, but this one differs and slow:
>>> $ perl -Mre=debug -e "/<X[^>]*(?-i:>)(.*?)(?=<a)/gis"
>>>
>>> Last line in first two (fast) regex was:
>>>   stclass EXACTF<<X>  minlen 3
>>> but in last (slow) regex it was:
>>>   floating ">" at 2..2147483647 (checking floating) stclass EXACTF<<X>
>>> minlen 3
>>>
>>> Right now we've huge amount of parsers working 24x7 with thousands of
>>> different regexps, and it's SLOW. Can anyone give some recommendations
>>> about how to work around this issue - like "add /i flag to every regex
>>> whenever possible", etc.? Because for now I'm even doesn't sure adding
>>> this redundant /i doesn't slowdown one regexp while speeding up another...
>>
>> You are encountering an optimisation that normally speeds things up by
>> allowing perl to fail fast.
>>
>> This is called the "longest constant string" optimization. During the
>> optimization phase we scan the pattern and calculate the position, and
>> value of the longest constant strings which are at a constant and
>> floating offset.
>>
>> So in your "slow" pattern that would be for constant offset '<X' and
>> for floating offset '>'.
>>
>> For various reasons we prefer the floating string for this check.
>>
>> My guess is that the performance impact here is that your data is long
>> has a LOT of '>' in it, which would then result in perl scanning to
>> find the _last_ (iirc). Which is probably not the right thing.
>>
>> Now for various reasons this optimisation is disabled with a case
>> insensitive match. Which means that perl no longer sees the ">" and no
>> longer considers it a viable optimisation target, and therfore goes
>> with a different strategy which is to look for the "<X" and then match
>> from there. Your data probably makes this a much better strategy.
>>
>> Its hard to say what to do here. *Most* times this optimization is
>> highly useful. In your case clearly not.
>>
>> There certainly is not a generic solution here. For instance enabling
>> /i might make a few of your patterns faster, but likely will slow all
>> the rest down.
>
> Just in case someone else wants to add /i to make regex search faster:
> DO NOT add /i in the general case.
>
> In the general case and esp. for short scripts adding /i makes regex
> searches slower. It has to demand-load the whole unicore/To/Fold case tables
> (that's 1.6MB on 32bit), and searches for all unicode cases also.
>
> utf8_heavy.pl is demand-loaded, also Carp is loaded.
> Carp loads warnings which in turn loads all warnings::register_categories
> which is about 68kB on 32-bit.

I believe this is fixed now in blead.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About