* Marvin Humphrey (marvin@rectangular.com) [070207 22:25]: > On Feb 7, 2007, at 10:37 AM, Mark Overmeer wrote: > Space occupied by the charset labels isn't my concern. The scenario > I'm worried about is where somebody has calibrated the memory > consumption of an string-manipulating application to fit within > available RAM, or is reasonably close to threshold by happenstance. > > Say someone reads in a string that occupies 300MB when encoded as > UTF-8. Say it's mostly ASCII, but has a few code points above the > BMP thrown in -- musical symbols like the sixteenth note (U+1D161), > or what have you. Ka-boom, now that string occupies more than a gig. Well, normalizing into 32bit or UTF8, it is to be decided and more given as example. You even may decide to store strings dependent on efficiency: if you see that ik grows over 100k you use slow but smaller UTF8, otherwise full 32bit. If the string is over 2M, you use huffman or gzip compressed 32bit... Whole new areas of optimization are possible when you add an "encoding/charset" field to each string. But my main target is to hide explicit recodings. > Defaulting to 32-bit storage forces the programmer to deal with worst- > case scenarios right away. No, it will make all programs slow right away. Requesting system resources is expensive. And how would you protect the same guy from not allocating a 400MB string where he tuned for max 300MB? > What I was getting at, though, was that a sudden, dramatic increase > in worst-case-scenario RAM requirements shouldn't be considered > backwards compatible. 5.12 does not need to be backwards compatible to this extend. -- MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net