* Nicholas Clark (nick@ccl4.org) [070207 11:52]: > Jarkko's view, based on the battle scars from the dragons in the regexp > engine, was that fixed width 32 bit was better than anything 16 bit variable > width. Doing the latter properly *still* requires dealing with surrogates. > > The *best* solution might well be fixed 7/8/16/32, using the smallest that > fits. > > But I don't see this coming soon. And for 7/8bit you would like to keep track of the character-set used in the string, such that you can automatically convert to unicode when need. And filenames defined inside your program to the charset used on a particular file-system. And... implicit conversions where we require explicit conversions now. Hum... so each string needs an associated charset label (which also determines the number of bytes per character) and each string operation needs to be aware that operands may require conversion before use... Maybe: if both encodings are different, than always convert both to U32. Sounds like a lot of work, but rather straight forward for most of the way. -- MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net