develooper Front page | perl.perl5.porters | Postings from November 2008

Re: char16 datatype

Thread Previous | Thread Next
November 14, 2008 00:27
Re: char16 datatype
Message ID:
2008/11/14 karl williamson <>:
> Tom Christiansen wrote:
>> [snip]
>> There's a bunch going on with standardization, widechars, utf-8, etc,
>> right
>> now. If only UTF-8 had been around earlier ("What, 1992 isn't early
>> enough?"), a lot of trouble would have been averted.  That Perl settled on
>> UTF-8 internally early on was applauded by the Association's current
>> standards rep as clearly the right way to go.
>> It's really sad that it looks like the C std committee look to be going to
>> accept Microsoft's char16 datatype for wide characters.  This locks you
>> into UCS-2/UTF-16, whihc means surrogates to get off the primary plane,
>> and
>> a very long/bad recovery if you get poke your head in the wrong place in
>> the stream.  This is going to make problems for people.  Java has the
>> problem.  EXIF has the problem.
> I have a friend on the ISO C committee.  I sent him the above snippet and
> asked him to comment.  This may not have anything really to do with Perl 5,
> but since it got brought up, fyi, here's his response:
> Karl,
>  C has always bent over backwards to be character set agnostic, so that any
> reasonable character set would work with it.  That is not going to change.
>  Most people will still use char, which these days will usually get UTF-8,
> depending on the locale.  When that is not enough, most people will still
> use wchar_t, which these days will usually get UTF-32.

From the viewpoint of data processing UTF-32 is a far superior
representation than UTF-8.

From the viewpoint of data storage not so. However given that
datastorage becomes cheaper and cheaper the utility of UTF-8
essentially becomes that it is backwards compatible with legacy
software like the *nix filesystems. Ultimately UTF-8 was a kludge,
developed practically overnight to ensure that there would be a
unicode representation that was unix legacy compatible, with the long
term intention of replacing it with something better. Win32 switched
to UCS-2 and then to UTF-16, and it wouldnt surprise me if in some
future iteration they switch to UTF-32. The question is how long do
the *nixes stick with the kludge?


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About