> * Convert from and to UTF-32 > * lengths in bytes, characters, and possibly glyphs > * character size (with the variable length ones reporting in negative numbers) What do you mean by character size if it does not support variable length? > * get and set the locale (This might not be the spot for this) The locale should be context based. Each thread should have its own locale. > * normalize (a noop for non-Unicode data) > * Get the encoding name The encoding name is tricky. Neither Java or POSIX defines their naming scheme. I personally prefer full name with lower case, such as "iso8859-1", the API converts name to lower automatically. The encoding name must be strict ASCII. Some common aliases may be provided. There must be an API to list all supported encoding during runtime. > * Do a substr operation by character and glyph The byte based is more useful. I have utf-8, and I want to substr it to another utf-8. It is painful to convert it or linear search for charaacter position. > I don't know if we want to treat encoding and data format separately--it > would seem to make sense to be able to have a string tell us it's > Unicode/UTF-32/Korean rather than just UTF-32/Korean, since I > don't see why it wouldn't be allowable to use the UTF-8 or UTF-16 encoding > on non-Unicode data. (Not that it'd necessarily be all that useful, and I > can see just not allowing it) I don't see the core should support language/locale in this detail. I deal a lot of mix chinese/english text file. There is no way to represent it using plain string, unless you want to make string be rich-format-text -buffer. Current locale or explicit locale parameter will suffice your goal. HongThread Previous | Thread Next