On Sat, Feb 10, 2007 at 03:25:38PM +0100, Gerard Goossen wrote: > Nonsence. benchmark** of the most basic search: > NONE*: 1.79s > UTF-8: 1.86s > UTF-16: 2.10s > UTF-32: 2.90s > > Which is worse then lineair in memory usage (probably due to cache > misses). > And this is without optimalizations, like if you are searching for a > longer patterns, you can do match multiple codepoints using one integer > comparison. You didn't mention what platform you are using - Intel vs AMD vs SPARC make a difference. Also, the code you provided was not tuned towards different uses, and you didn't provide the compiler or optimization level used. On my AMD64, I find that *within the same cache line*, byte accesses are slower than word accesses. In fact, almost any number of reads that can be fully or nearly fully parallelized at the instruction level, take approximately the same time to complete, as long as only one cache is touched. It's very intriguing when considering short strings that are <64 bytes in length, which I would suggest, make up the most common strings in existence. Yes, the cost of reads will get more expensive if longer strings of 32-bit words are used, as compared to 8-bit words, if 4x as many cache lines need to be loaded. I don't believe this is the common case. In any case - UTF-16 is cheaper to process than UTF-8 in terms of machine instructions, and for languages other than English, UTF-16 is at least efficient as UTF-8 in terms of memory usage. UTF-8 shines the most for languages that are ASCII. Although, in your case, you've changed the rules and allowed UTF-EBCDIC to shine if your native encoding is EBCDIC. You cheated. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/