Front page | perl.perl5.porters |
Postings from November 2008
Re: char16 datatype
November 14, 2008 00:27
Re: char16 datatype
Message ID: firstname.lastname@example.org
2008/11/14 karl williamson <email@example.com>:
> Tom Christiansen wrote:
>> There's a bunch going on with standardization, widechars, utf-8, etc,
>> now. If only UTF-8 had been around earlier ("What, 1992 isn't early
>> enough?"), a lot of trouble would have been averted. That Perl settled on
>> UTF-8 internally early on was applauded by the Association's current
>> standards rep as clearly the right way to go.
>> It's really sad that it looks like the C std committee look to be going to
>> accept Microsoft's char16 datatype for wide characters. This locks you
>> into UCS-2/UTF-16, whihc means surrogates to get off the primary plane,
>> a very long/bad recovery if you get poke your head in the wrong place in
>> the stream. This is going to make problems for people. Java has the
>> problem. EXIF has the problem.
> I have a friend on the ISO C committee. I sent him the above snippet and
> asked him to comment. This may not have anything really to do with Perl 5,
> but since it got brought up, fyi, here's his response:
> C has always bent over backwards to be character set agnostic, so that any
> reasonable character set would work with it. That is not going to change.
> Most people will still use char, which these days will usually get UTF-8,
> depending on the locale. When that is not enough, most people will still
> use wchar_t, which these days will usually get UTF-32.
From the viewpoint of data processing UTF-32 is a far superior
representation than UTF-8.
From the viewpoint of data storage not so. However given that
datastorage becomes cheaper and cheaper the utility of UTF-8
essentially becomes that it is backwards compatible with legacy
software like the *nix filesystems. Ultimately UTF-8 was a kludge,
developed practically overnight to ensure that there would be a
unicode representation that was unix legacy compatible, with the long
term intention of replacing it with something better. Win32 switched
to UCS-2 and then to UTF-16, and it wouldnt surprise me if in some
future iteration they switch to UTF-32. The question is how long do
the *nixes stick with the kludge?
perl -Mre=debug -e "/just|another|perl|hacker/"