develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

From:
Juerd Waalboer
Date:
February 8, 2007 04:04
Subject:
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
20070208120413.GT25362@c4.convolution.nl
demerphq skribis 2007-02-08 12:19 (+0100):
> But it still doesnt change the fact that the engine would be massively
> slower under utf8 than under a sane encoding like utf32.

That makes a lot of sense. My point was that relative to /other
engines/, it's fast.

> It has variable length characters (..) Also UTF8 has the property that
> there is no valid utf8 sequence that is itself a subsequence of a
> valid utf8 sequence.

Because of the latter, the former is not a big problem. That is, if your
application allows you to be naive and just match bytes instead of
characters.

In my "plan", you'd consider each byte a character.

> Also your plan wouldnt work with case insensitive matching would it?

Correct. But I did specifically mention that it is an incorrect
solution. Here, correctness is traded in for performance. When trying to
find all € signs in a 300 MB string, and replacing them with ASCII E's,
ignoring that you're doing UTF8 helps a lot. 

You'd look for \x{20ac}, you'd be looking for \xe2\x82\xac.

> >Perhaps it would be nice to have a pragma that does just this (upgrade
> >all regex subjects and patterns, and ignore the UTF8 flag in pattern
> >matches), for those who want or need extreme performance.
> They wouldnt see a performance increase, they would see a noticable
> performance decrease. In some cases drammatic ones.

If you still want character semantics, indeed. But when you're dealing
with fixed strings (no character classes), your code could benefit from
it. My initial (silly, but sufficient) benchmark on some very simple
s/%([A-Z]+)%/$vars{$1}/g templating regex, with largeish real world
data, indicates an overall win of approx 25%. And that's with
utf8::upgrade() and Encode::_utf8_off and Encode::_utf8_on in Perl
space, so it could be even better.

And I'm leaving the hack in, because this application can use the extra
performance. The string that is s///'ed, and the %vars are all Unicode
data, but this particular regex doesn't have to care. All tests still
pass. 

By the way - I retract that a pragma would be good. I should have
proposed a regex flag. Unfortunately, /i (ignorant mode) is taken, so
perhaps /d (dumb) or /n (naive).
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About