Front page | perl.perl6.internals |
Postings from August 2002
Encodings - Questions
From:
Angel Faus
Date:
August 28, 2002 11:29
Subject:
Encodings - Questions
Message ID:
200208282026.11239.afaus@corp.vlex.com
Hi,
Now that we've got ICU in, I thought it may be time to revisit the
encodings implementation. I am a clamorous ignorant is
unicode/encodings issues, so please be patient with me. :)
From what I have asked people at IRC, and what's on the list archives,
my understanding of how parrot will work with various encodings is:
i) After an IO operation, strings are preserved on their original
encoding, whatever it is.
ii) Parrot will try to keep the string in such encoding, for as long
as possible.
iii) When some operation that requires ICU is requested, parrot will
feed the string to ICU, who will convert it to UTF-16 (its internal
encoding) and then perform the desired operation.
Please correct me if this is wrong. Now, my questions are:
I. About iii): I can imagine at least three different options about
what to do with the converted UTF-16 string:
a) We can discard the UTF-16 version, and recompute the conversion
each time. (this is costly, isn't it?)
b) We can replace the original string with the "upgraded" version, so
strings will lazily become converted to UTF-16. (this makes sure that
the conversion is only done once, but is conversion to UTF-16 always
lossless?)
c) We can store the UTF-16 version along the original one. (this is
doubles the memory usage, plus it may be hard to implement)
Each approach has its pros and cons. Which one is the right one?
II. About ii): Which is exactly the point at which we decide to feed
the string to ICU, and what operations should we (as parrot
developers) implement in our own layer?.
For example, let's take a relatively simple operation, such as
uppercasing an string, and let's assume that the string is on a
friendly encoding, such as ISO-8859-1.
Even with this assumptions, conversion to uppercase is not
straightforward, since it's locale-dependent (or to be more precise,
it might be locale-dependent if the user chooses to).
We could, of course, implement all locale-aware operations for each
encoding and each locale, but how much work do we want to put on
this?
So, exactly what string functionalities do we want to implement
ourselves in a per-encoding basis, and which ones are we going to
forward to ICU?
-angel
-
Encodings - Questions
by Angel Faus