On Wed, Feb 07, 2007 at 01:09:23PM +0100, Mark Overmeer wrote: > * Nicholas Clark (nick@ccl4.org) [070207 11:52]: > > Jarkko's view, based on the battle scars from the dragons in the regexp > > engine, was that fixed width 32 bit was better than anything 16 bit variable > > width. Doing the latter properly *still* requires dealing with surrogates. > > > > The *best* solution might well be fixed 7/8/16/32, using the smallest that > > fits. > > > > But I don't see this coming soon. > > And for 7/8bit you would like to keep track of the character-set used > in the string, such that you can automatically convert to unicode when > need. It's simpler to always convert to Unicode on the way in, and to $whatever on the way out. After all, (as I understand it) one of the features of Unicode is that it is a superset of all existing encodings. Hence why some of its choices for what gets distinct code points can seem rather cranky. I think that Parrot was doing this - convert to Unicode, then store in the shortest fixed width representation that holds all the code points used in that string. It's easy to concatenate strings, without needing to pivot between encodings each time > And filenames defined inside your program to the charset used on > a particular file-system. And... implicit conversions where we require IIRC Jarkko has again looked at that recently, and most operating systems have no sane API to find out what is being used on a particular mounted filing system. Nicholas Clark