develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

February 8, 2007 04:19
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
On 2/8/07, Juerd Waalboer <> wrote:
> demerphq skribis 2007-02-08 12:19 (+0100):
> > But it still doesnt change the fact that the engine would be massively
> > slower under utf8 than under a sane encoding like utf32.
> That makes a lot of sense. My point was that relative to /other
> engines/, it's fast.

That may be, I dont have any benchmarks to look at. And since my use
of unicode is essentially restricted to making my regexp engine
patches support it properly ill take your word on it.

> > It has variable length characters (..) Also UTF8 has the property that
> > there is no valid utf8 sequence that is itself a subsequence of a
> > valid utf8 sequence.
> Because of the latter, the former is not a big problem. That is, if your
> application allows you to be naive and just match bytes instead of
> characters.
> In my "plan", you'd consider each byte a character.

Oh, sorry, maybe i should have ordered my reply differently, I said
that only because you said

"This will, however, hurt a lot when your pattern happens to be
capable of match in the middle of a UTF8 sequence."

which is something you dont need to worry about if you have converted
both the pattern and the string to a normalized utf8 representation

And of course, you are right, when doing a case sensitive match and
you have properly utf8 normalized the input and pattern you can using
byte semantics (and all the optimisations for latin_1, especially FBM
matching) to do a much more efficient match.

> > Also your plan wouldnt work with case insensitive matching would it?
> Correct. But I did specifically mention that it is an incorrect
> solution. Here, correctness is traded in for performance. When trying to
> find all € signs in a 300 MB string, and replacing them with ASCII E's,
> ignoring that you're doing UTF8 helps a lot.
> You'd look for \x{20ac}, you'd be looking for \xe2\x82\xac.

Right right, i think i misunderstood your point.

> > >Perhaps it would be nice to have a pragma that does just this (upgrade
> > >all regex subjects and patterns, and ignore the UTF8 flag in pattern
> > >matches), for those who want or need extreme performance.
> > They wouldnt see a performance increase, they would see a noticable
> > performance decrease. In some cases drammatic ones.
> If you still want character semantics, indeed. But when you're dealing
> with fixed strings (no character classes), your code could benefit from
> it. My initial (silly, but sufficient) benchmark on some very simple
> s/%([A-Z]+)%/$vars{$1}/g templating regex, with largeish real world
> data, indicates an overall win of approx 25%. And that's with
> utf8::upgrade() and Encode::_utf8_off and Encode::_utf8_on in Perl
> space, so it could be even better.
> And I'm leaving the hack in, because this application can use the extra
> performance. The string that is s///'ed, and the %vars are all Unicode
> data, but this particular regex doesn't have to care. All tests still
> pass.

I retract my comment somewhat, I thought you were talking about
something different from what you are clearly saying now.

I can see the rationale, and yes, as long as you dont need any
matching with true character semantics this would be a good way to do
utf8 exact matching efficiently.

> By the way - I retract that a pragma would be good. I should have
> proposed a regex flag. Unfortunately, /i (ignorant mode) is taken, so
> perhaps /d (dumb) or /n (naive).

Eeek. New modifiers are trouble.


Id prefer to see this happen automatically when the pattern is utf8
and exact and doesnt use charclasses or anything that needs true
character semantics.


perl -Mre=debug -e "/just|another|perl|hacker/" Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About