Front page | perl.perl6.porters |
Postings from February 2000
Re: Locales: An Analysis
From: Tim Bray
February 4, 2000 09:27
Re: Locales: An Analysis
Message ID: firstname.lastname@example.org
[some horribly-unstructured random data points]
At 12:25 AM 2/4/00 -0800, Chip Salzenberg wrote:
>> : use charset 'iso10646'; == force ISO 10646 (Unicode superset)
>> Not really a superset anymore, unless you're into defining your own
>> characters outside of U+10FFFF.
>I don't understand... Could someone point me to a description of the
>current Unicode <-> ISO 10646 relationship?
It appears in one of the appendices of the [excellent, go buy it from
unicode.org] Unicode spec. Essentially, they are the same spec, but this
is achieved by an elaborate parallel structure of committees and
working groups who always magically and independently do the same thing;
of course many of the people serve in both processes.
There is one conceptual difference; 10646 says in theory you can have
2^31 characters. Unicode only recognizes 2^16 + 2^20 (BMP + 16 expansion
planes). I wonder if in the year 2345, they'll be cursing the short-sighted
21st-century Unicode morons whose 17 planes didn't leave room for the
dialects of the Lesser Magellanic cloud worlds. Well, do like Larry says
and use 4 bytes and that should get us through most of the millenium.
I feel that one of the nice things about using perl is that you shouldn't
have to worry about things like UTF-16's [rather reasonable I think]
extension mechanism, or about the hideous bit-packing bogosities of UTF-8,
which are only defensible in a world whose basic technical infrastructure
depends heavily on strlen() and strcpy(), but that's the world we happen to
live in. Note that UTF-8 is kinda bigoted in that us pink-skinned roundeyes
get to store most of our characters in one byte per, leaving it to the other
75% of the world's population to pay the price, in extra bytes, for
BTW, should ord($c) return different values depending on whether or not
I've said "use utf8;"?
It should be noted that over in Java-land, UTF-16 is more or less the
native dialect, and UTF-8 is a royal pain in the butt to deal with. Sigh.
Also, there are lots of non-Asian non-Unicode non-8859 character sets;
probably the best known is KOIsomething for Cyrillic.
Over in XML-land, including the increasingly popular XML::Parser
module, the data can come in in a variety of flavors, but the
programmer only sees Unicode, and if you write it out you find you've
written UTF-8. Having all programmers see only Unicode all the time is
a big enough win, even though when programmers first see it you
tend to get some whining about Stupid MS Code Page Tricks not working.
On the other hand, the data magically transmogrifying itself from
JIS or EBCDIC or whatever to UTF-8 as a result of a trip through perl
is kinda off-putting... pardon the digression.
Using Unicode takes care of some but by no means all of the collation-
sequence problems that locale used to help with. Hmm
There are some general-purpose international character-munging libraries
out there, the best known being GNU iconv and ICU from IBM alphaworks.
ICU is C++. They each depend on meg after meg of tables so you can
wire in support for various legacy encodings. If you could skip the
legacy-encoding stuff and just do collating and other locale-ish stuff,
you could do it relatively compactly. What do Locales Done The Right
Way need to do? -T.