Front page | perl.perl5.porters |
Postings from September 2000
unicode support and perl
Thread Next
From:
Marc Lehmann
Date:
September 13, 2000 20:42
Subject:
unicode support and perl
Message ID:
20000914054225.B14317@cerebro.laendle
I just read through the Encode dicussion. Having written a similar module
(much less ambitious), namely Convert::Scalar to solve my immediate needs,
I stumbled through the same problems, which I mainly solved by keeping to
the internal API.
While the discussion about Encode seems right to me, I'd like to diagnose the
following problems:
- Encode tries to mix two entirely orthogonal concepts: switching
between the character set a scalar uses and the bit representation
of characters. While it is nice to be able to encode/decode between
character encodings and even character sets (I would certainly one of
the first users, having palyed with implementing an iconv interface for
some time now), this MUST NOT be confused with how perl handles these
bit representation internally.
At the moment, perl supports "plain bytes" and "utf-8" as the only
representations a scalar can be in. Trying to hide this from the user
using some generic API will fail miserably. (For example, the user will
always need to be able to force representation, e.g. when reading data
from database in which case nothing (no i/o discipline or anything else)
will ever be able to automatically set that utf-8 flag).
Therefore the [INTERNAL] remarks in the Encode docs that were posted
recently are simply wrong: This must not be "internal", as there is a
large difference between a unicode string that perls treats as unicode
and a unicode string that pelr treats as bytes that happen to be valid
utf-8.
- perl is 100% unusable as soon as it comes to using utf8. I conjecture
that nobody has seriously done anything with perl and utf8, given the
amount of functionality that breaks when you do so. So first priority
should be to get it working in practise, before creating fancy modules
to convert between usable and unusable representations.
- related to the previous point: switching to ucs-2 might make sense in
terms of speed, but swiching to ucs-4 would be a horrible waste of space
AND speed. And given that UCS-2 is not really what we want (we want
UTF-16?), we will get the same problems we have with utf-8. That utf-8
is slower than ucs-2 also has to be proven first ;=) DECIDING on the
representation, however, is crucial since people (at least me) start to
make use of this functionality, and switching to some other internal
encoding will break this code.
I hope I do not sound too harsh, but I have just converted one of my
larger perl module into using utf-8, which requires me to write something
like "oh, and please use perl-5.7 + some custom patches if you want to use
this module" into the README, and even writing patches didn't have the
effect of getting this fixed.
]:-> just my Euro 0.02
--
-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / pcg@opengroup.org |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |
|
Thread Next
-
unicode support and perl
by Marc Lehmann