On Wed, 7 Feb 2007 10:44:55 +0100, demerphq <demerphq@gmail.com> wrote: >I for one would argue that if we were going to go to a single internal >encoding that utf8 would be the wrong one. Utf-16 would be much >better. It would allow us to take advantage of the large amount of >utf-16 code out there, ranging from DFA regexp engines to other >algorithms and libraries. On Win32 the OS natively does utf-16 so much Some people argue that only Windows XP and later does UTF-16; Windows 2000 is just UCS2 because it doesn't know about surrogate pairs. But this is really only a shell/display issue; the file system level doesn't care about surrogates anyways. >of the work would be done by the OS. Id bet that this was also a >reason why other languages choose to use utf-16. In fact i wouldnt be >surprised if we were the primary language using utf8 internally at >all. This is a couple years old and no longer up-to-date, but yes, it looks like Perl doesn't have much company... http://unicode.org/notes/tn12/tn12-1.html >I mean heck, utf8 was a kudge worked out on a napkin to make it >possible to store unicode filenames in a unix style filesystem. (utf8 >has the property that no encoding of a high codepoint contains any >special character used by a unix filesystem) WTF would we use a kludge >as our primary internal representation when there are better >representations to use? Especially when you consider the performance >impact of doing so (use unicode and watch the regex engine get much >sloooooweeeeeerrrrrrr.) This is probably the main reason some big enterprise users stick with Perl 5.6.1. I've seen several companies approach ActiveState, desperate to get help in moving to 5.8 while maintaining their application performance. Unfortunately there is not much you can do to help them beyond the "avoid using Unicode strings, and downgrade every time a module returns stuff in Unicode" advice. Cheers, -Jan