On Thu, Feb 08, 2007 at 09:06:31AM +0100, Dr.Ruud wrote: > Gerard Goossen schreef: > > > But for example when doing a regex for a > > fixed string like m/aap/ in UTF-8 you just have to do a memory search > > for the bytes representing 'aap' in UTF-32 you can do the same, but > > you have much more memory to search through. > > A string encoded such that it uses the word size of the platform, will > actually search faster. It is an extra step to get to the bytes, which > on most hardware takes considerable time. Nonsence. benchmark** of the most basic search: NONE*: 1.79s UTF-8: 1.86s UTF-16: 2.10s UTF-32: 2.90s Which is worse then lineair in memory usage (probably due to cache misses). And this is without optimalizations, like if you are searching for a longer patterns, you can do match multiple codepoints using one integer comparison. Gerard Goossen *: the NONE is without doing a memory lookup, so it measures the 'for' and the 'if' (so doing some loop unrolling would greatly improve performance here). **Benchmark: This is the most stupid search I could think of, but it demonstrates the impact of the size of the string, and that comparing bytes instead of ints doesn't matter. Adjust '#define test_t long' and Encode::decode('UTF-32-LE') to get measure the other encodings. XS(XS_search) { dVAR; dXSARGS; SV * const sv = ST(0); STRLEN len; #define test_t long test_t *s = (test_t*) (SvPV(sv, len)); test_t char1 = 0x41; test_t *send = (test_t*) (((char*)s) + len); for (; s < send; s++) { if (*s == char1) { XSRETURN(0); } } XSRETURN(0); } #!perl use Benchmark "timethese"; use strict; use warnings; use utf8; use Encode; my $n = 200_000; my $count = 200000000/$n; my $a = ("abcefgh" x $n) . "\x{20ac}\x{41}\x{43}\x{44}"; my $a_utf32 = Encode::encode("UTF-32-LE", $a); timethese( $count, { utf32 => sub { search( $a_utf32 ); }, } );