On Wed, Feb 07, 2007 at 03:32:44PM -0800, Jan Dubois wrote: > >I mean heck, utf8 was a kudge worked out on a napkin to make it > >possible to store unicode filenames in a unix style filesystem. (utf8 > >has the property that no encoding of a high codepoint contains any > >special character used by a unix filesystem) WTF would we use a kludge > >as our primary internal representation when there are better > >representations to use? Especially when you consider the performance > >impact of doing so (use unicode and watch the regex engine get much > >sloooooweeeeeerrrrrrr.) > This is probably the main reason some big enterprise users stick with > Perl 5.6.1. I've seen several companies approach ActiveState, desperate > to get help in moving to 5.8 while maintaining their application > performance. Unfortunately there is not much you can do to help them > beyond the "avoid using Unicode strings, and downgrade every time a > module returns stuff in Unicode" advice. I don't understand this. Computers are much faster than they ever were before. I don't understand how a company would be 'desperate' to stick with Perl 5.6.1, because Perl 5.8 is slower at some task. 18 months later? Problem solved. It's come to the point again where I've reverted back to "write the code that is simple and maintainable instead of efficient, as you notice the difference without benchmark testing anyways." Then there is the subject of people assuming that the world is ASCII. UTF-8 is only more efficient that UTF-16 at storage for ASCII characters. For non-ASCII characters, UTF-16 is equal or more efficient in terms of storage, and much more efficient in performance. While popular processors are almost all 64-bits now, code is still doing per-byte comparisons. I've done timings before and found that my processor (AMD64) can deal with 16-bit and 8-bit quantities at approximately the same speed. Even though more cache lines are used with 16-bit. This makes me conclude that 8-bit is actually *slower* on modern architectures in terms of processing requirements. It's a different world. Perl tries to play both sides with its 8-bit/UTF-8 strings. The result is confusion. Perl should have done a better job of making the encoding transparent. Should have. Could have. Oh well. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/