develooper Front page | perl.perl5.porters | Postings from January 2013

[perl #107816] Performance regression since 0abd0d78

Thread Next
From:
James E Keenan via RT
Date:
January 28, 2013 01:22
Subject:
[perl #107816] Performance regression since 0abd0d78
Message ID:
rt-3.6.HEAD-27190-1359336133-1358.107816-15-0@perl.org
On Tue Jan 17 09:08:29 2012, demerphq wrote:
> On 17 January 2012 08:36, Simon Kirby <sim@netnation.com> wrote:
> > On Tue, Jan 10, 2012 at 01:27:55AM -0800, yves orton via RT wrote:
> >
> >> > If I rebuild Perl with this commit reverted, performance improves
> by
> >> > about 20%, but not what it was before the upgrade. (I will
> probably have
> >> > to do more bisecting.)
> >> > I've simplified this particular test-case to:
> >> >
> >> > time perl -e '$c="y"x10000000;$c=~/a|b/i'
> >> >
> >> > This takes about .04 seconds on Perl 5.10.0, and 1.44 seconds on
> Perl
> >> > 5.10.1 (with that backported commit). Newer Perl 5 versions such
> as
> >> > 5.14.2 are the same.
> >>
> >> This patch blocked the TRIE optimisation from running on non-
> unicode
> >> patterns under /i. The reason is that the expected behavior is
> >> dependent on the utf8'ness of the string and the TRIE optimisation
> >> does not handle this properly. (Much to my embarrassment.) So this
> >> patch is required for correct behaviour. IOW, we are trading speed
> for
> >> correctness here. ;-) So what this patch does is restore the match
> >> behavior to that of 5.8.x.
> >>
> >> $ perl -Mre=debug -le'/a|b/'
> >> Compiling REx "a|b"
> >> Final program:
> >> � �1: TRIE-EXACT[ab] (7)
> >> � � � <a>
> >> � � � <b>
> >> � �7: END (0)
> >> stclass AHOCORASICK-EXACT[ab] minlen 1
> >> Freeing REx: "a|b"
> >> $ perl -Mre=debug -le'/a|b/i'
> >> Compiling REx "a|b"
> >> Final program:
> >> � �1: BRANCH (4)
> >> � �2: � EXACTF <a> (7)
> >> � �4: BRANCH (FAIL)
> >> � �5: � EXACTF <b> (7)
> >> � �7: END (0)
> >> minlen 1
> >> Freeing REx: "a|b"
> >>
> >> Having said that I am wondering if your test-case is really
> >> representative? Does the alternation actually consist of only
> single
> >> characters?
> >
> > Well, no, it is likely not this simple, and I probably
> oversimplified the
> > test-case. I actually started with a longer testcase based a script
> I
> > found before (see nntp URL in my original post):
> >
> > #!/usr/bin/perl
> >
> > use strict;
> > use warnings;
> >
> > my $content = 'y' x 1000000;
> >
> > my @seq = ( 'agggtaaa|tttaccct',
> > � � � �'[cgt]gggtaaa|tttaccc[acg]',
> > � � � �'a[act]ggtaaa|tttacc[agt]t',
> > � � � �'ag[act]gtaaa|tttac[agt]ct',
> > � � � �'agg[act]taaa|ttta[agt]cct',
> > � � � �'aggg[acg]aaa|ttt[cgt]ccct',
> > � � � �'agggt[cgt]aa|tt[acg]accct',
> > � � � �'agggta[cgt]a|t[acg]taccct',
> > � � � �'agggtaa[cgt]|[acg]ttaccct' );
> >
> > my @cnt = (0) x @seq;
> > for my $k (0..$#seq) {
> > � � � �++$cnt[$k] while $content=~/$seq[$k]/gi;
> > }
> >
> > ...and this is what I used for bisection to this commit. When I
> revert
> > it, I _do_ see an obvious performance improvement, but performance
> is not
> > completely restored to what it was with MailScanner on Debian lenny
> with
> > Perl 5.10.0. There are likely others factors at play, since I had to
> > upgrade MailScanner before it worked on Perl 5.10.1 due to some
> changes
> > in the tainting rules. I may try to run the new MailScanner on the
> older
> > Perl for comparison purposes, or try to run it in a profiler and
> start
> > from the top.
> >
> > Could you suggest a profiling suite that would work well in this
> case?
> 
> I will have to look into this further another time. But I will say
> 
> a) this is absolutely a pattern that would be affected by this
> optimisation.
> b) try lowercasing your source data and losing the /i and you should
> get the performance back
> c) you could probably recode this so it does one match. Something like
> this:
> 
> my @seq = ( 'agggtaaa|tttaccct',
>  � � � �'[cgt]gggtaaa|tttaccc[acg]',
>  � � � �'a[act]ggtaaa|tttacc[agt]t',
>  � � � �'ag[act]gtaaa|tttac[agt]ct',
>  � � � �'agg[act]taaa|ttta[agt]cct',
>  � � � �'aggg[acg]aaa|ttt[cgt]ccct',
>  � � � �'agggt[cgt]aa|tt[acg]accct',
>  � � � �'agggta[cgt]a|t[acg]taccct',
>  � � � �'agggtaa[cgt]|[acg]ttaccct' );
> 
>  my @cnt = (0) x @seq;
>  $content= lc $content;
>  my @pattern;
>  foreach my $seq_idx (0..$#pattern) {
>      my $seq= $pattern[$seq_idx];
>      my @parts= split /\|/, $seq;
>      my $mark= "(*MARK: $seq_idx)";
>      push @pattern, "$_$mark" for @parts;
>  }
>  my $qr= join "|", @pattern;
>  while ($content=~/$qr/g)
>  � � � �++$cnt[$REGMARK];
>  }
> 
> This will however be slightly different in behaviour, as it wont find
> matches of different $seq that overlap.
> 
> More possibly later.
> Yves
> 

Yves, list:  Are there still issues we need to discuss in this ticket? 
If not, can we close it?

Thank you very much.
Jim Keenan

---
via perlbug:  queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=107816

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About