develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance

From:
mark
Date:
February 10, 2007 06:49
Subject:
Re: unicode regex performance
Message ID:
20070210144929.GA24120@mark.mielke.cc
On Sat, Feb 10, 2007 at 03:25:38PM +0100, Gerard Goossen wrote:
> Nonsence. benchmark** of the most basic search:
> NONE*:  1.79s
> UTF-8:  1.86s	
> UTF-16: 2.10s
> UTF-32: 2.90s
> 
> Which is worse then lineair in memory usage (probably due to cache
> misses).
> And this is without optimalizations, like if you are searching for a
> longer patterns, you can do match multiple codepoints using one integer
> comparison.

You didn't mention what platform you are using - Intel vs AMD vs SPARC
make a difference. Also, the code you provided was not tuned towards
different uses, and you didn't provide the compiler or optimization
level used.

On my AMD64, I find that *within the same cache line*, byte accesses
are slower than word accesses. In fact, almost any number of reads
that can be fully or nearly fully parallelized at the instruction
level, take approximately the same time to complete, as long as only
one cache is touched. It's very intriguing when considering short
strings that are <64 bytes in length, which I would suggest, make up
the most common strings in existence.

Yes, the cost of reads will get more expensive if longer strings of
32-bit words are used, as compared to 8-bit words, if 4x as many cache
lines need to be loaded. I don't believe this is the common case.

In any case - UTF-16 is cheaper to process than UTF-8 in terms of
machine instructions, and for languages other than English, UTF-16
is at least efficient as UTF-8 in terms of memory usage. UTF-8 shines
the most for languages that are ASCII. Although, in your case, you've
changed the rules and allowed UTF-EBCDIC to shine if your native
encoding is EBCDIC. You cheated.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About