develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance

From:
Dr.Ruud
Date:
February 8, 2007 00:08
Subject:
Re: unicode regex performance
Message ID:
20070208080751.24029.qmail@lists.develooper.com
Gerard Goossen schreef:

> But for example when doing a regex for a
> fixed string like m/aap/ in UTF-8 you just have to do a memory search
> for the bytes representing 'aap' in UTF-32 you can do the same, but
> you have much more memory to search through.

A string encoded such that it uses the word size of the platform, will
actually search faster. It is an extra step to get to the bytes, which
on most hardware takes considerable time.


> Whether UTF-8 or UTF-16 is faster depends on the content of your
> strings, if you mostly have ASCII data UTF-8 would be shorter and thus
> faster, if you are dealing with non western languages, the UTF-16
> encoding will probably be shorter and thus faster.

That UTF-32 is equivalent to UCS-4 restricted to 0..10FFFF(16), so
*fixed width*, is important.
http://unicode.org/reports/tr19/tr19-9.html

-- 
Affijn, Ruud

"Gewoon is een tijger."




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About