Front page | perl.perl5.porters |
Postings from February 2001
The State of The Unicode
From: Jarkko Hietaniemi
February 19, 2001 14:48
The State of The Unicode
Message ID: 20010219164753.B15351@chaos.wustl.edu
I appreciate your detailed analysis, it's certainly more detailed
than what we have seen over the last few days.
From my viewpoint, however, the situation is as follows:
(1) The current model, both externally and internally,
follows what is described by the Camel Mk3. As the pumpkin
I'm somewhat obligated to abide by that, at least that's the
first degree approximation. (Incidentally, the reason I think
the Camel is so vague was that when it was written the Unicode
model was beinh ripped to shreds to be rebuilt, in a discussion
not unlike the one we are having.)
(2) The basic Unicode support seems to be in a rather good shape now.
What I mean by "basic" is that as long you don't start pulling your hair
over this very bytes vs UTF-8 vs characters issue, and just concatenate
strings, compare them, take their length, do regexes on then, etc, pretty
much everything seems to be working.
Combine (1) and (2) and I see it as "what is broken, so what's there to
fix" situation, let's call it (3).
As far "what is broken", I do understand the concern of "exposing too
much of the internal representation" (which at the moment happens to
be UTF-8) to the user, having bytes and character is confusing at
best. However, I'm not fully convinced that completely hiding it is
wise, either. If from Perl level one cannot reach back to the bytes
comprising the UTF-8 representation of the characters, I feel we are
trying to pad the cell too softly. One has to be able to do bytes if
one wants to. I can live without bytes::length, I can live without
qu, but taking completely away the ability to get at the bytes of
chr(0x100) is folly. (A logical extension would be to take away
unpack("b" and "B") and vec(), whoever needs those nasty bits anyway.)
Now, lately it has been amply pointed out that there's something
wrong. How much is wrong ranges, depending on who is speaking, from
everything is wrong including the internal representation of scalars
containing bytes, to that some details about about exposing the internals
too much need to cleaned up, and we should not forget the EBCDICers.
What this recent furor seems to be over is mainly two things: the new
qu operator, and that the methods for "creating characters" were made
consistent over the contentious 0x80..0xff range.
If it is decided to hide everything (note my above reservation over
*everything*) about the UTF-8-ness of the characters, fine. But
because of (3), I am personally reluctant to start unless someone
presents me a list of things to hide. It need not be an essay or
a 500-page tech report, just a comprehensive list of things to do.
I can do the coding, I've already more or less trampled over all
the UTF-8 related code there is in the Perl code at least once,
but I need a list.
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen