develooper Front page | perl.perl6.internals | Postings from August 2002

Encodings - Questions

Angel Faus
August 28, 2002 11:29
Encodings - Questions
Message ID:

Now that we've got ICU in, I thought it may be time to revisit the 
encodings implementation. I am a clamorous ignorant is 
unicode/encodings issues, so please be patient with me. :)

From what I have asked people at IRC, and what's on the list archives, 
my understanding of how parrot will work with various encodings is:

i) After an IO operation, strings are preserved on their original 
encoding, whatever it is.

ii) Parrot will try to keep the string in such encoding, for as long 
as possible.

iii) When some operation that requires ICU is requested, parrot will 
feed the string to ICU, who will convert it to UTF-16 (its internal 
encoding) and then perform the desired operation.

Please correct me if this is wrong. Now, my questions are:

I. About iii): I can imagine at least three different options about 
what to do with the converted UTF-16 string:

a) We can discard the UTF-16 version, and recompute the conversion 
each time. (this is costly, isn't it?)

b) We can replace the original string with the "upgraded" version, so 
strings will lazily become converted to UTF-16. (this makes sure that 
the conversion is only done once, but is conversion to UTF-16 always 

c) We can store the UTF-16 version along the original one. (this is 
doubles the memory usage, plus it may be hard to implement)

Each approach has its pros and cons. Which one is the right one?

II. About ii): Which is exactly the point at which we decide to feed 
the string to ICU, and what operations should we (as parrot 
developers) implement in our own layer?.

For example, let's take a relatively simple operation, such as 
uppercasing an string, and let's assume that the string is on a 
friendly encoding, such as ISO-8859-1. 

Even with this assumptions, conversion to uppercase is not 
straightforward, since it's locale-dependent (or to be more precise, 
it might be locale-dependent if the user chooses to).

We could, of course, implement all locale-aware operations for each 
encoding and each locale, but how much work do we want to put on 

So, exactly what string functionalities do we want to implement 
ourselves in a per-encoding basis, and which ones are we going to 
forward to ICU?

-angel Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About