develooper Front page | perl.perl5.porters | Postings from September 2000

unicode support and perl

Thread Next
From:
Marc Lehmann
Date:
September 13, 2000 20:42
Subject:
unicode support and perl
Message ID:
20000914054225.B14317@cerebro.laendle
I just read through the Encode dicussion. Having written a similar module
(much less ambitious), namely Convert::Scalar to solve my immediate needs,
I stumbled through the same problems, which I mainly solved by keeping to
the internal API.

While the discussion about Encode seems right to me, I'd like to diagnose the
following problems:

- Encode tries to mix two entirely orthogonal concepts: switching
  between the character set a scalar uses and the bit representation
  of characters.  While it is nice to be able to encode/decode between
  character encodings and even character sets (I would certainly one of
  the first users, having palyed with implementing an iconv interface for
  some time now), this MUST NOT be confused with how perl handles these
  bit representation internally.

  At the moment, perl supports "plain bytes" and "utf-8" as the only
  representations a scalar can be in. Trying to hide this from the user
  using some generic API will fail miserably. (For example, the user will
  always need to be able to force representation, e.g. when reading data
  from database in which case nothing (no i/o discipline or anything else)
  will ever be able to automatically set that utf-8 flag).

  Therefore the [INTERNAL] remarks in the Encode docs that were posted
  recently are simply wrong: This must not be "internal", as there is a
  large difference between a unicode string that perls treats as unicode
  and a unicode string that pelr treats as bytes that happen to be valid
  utf-8.

- perl is 100% unusable as soon as it comes to using utf8. I conjecture
  that nobody has seriously done anything with perl and utf8, given the
  amount of functionality that breaks when you do so. So first priority
  should be to get it working in practise, before creating fancy modules
  to convert between usable and unusable representations.

- related to the previous point: switching to ucs-2 might make sense in
  terms of speed, but swiching to ucs-4 would be a horrible waste of space
  AND speed.  And given that UCS-2 is not really what we want (we want
  UTF-16?), we will get the same problems we have with utf-8. That utf-8
  is slower than ucs-2 also has to be proven first ;=) DECIDING on the
  representation, however, is crucial since people (at least me) start to
  make use of this functionality, and switching to some other internal
  encoding will break this code.

I hope I do not sound too harsh, but I have just converted one of my
larger perl module into using utf-8, which requires me to write something
like "oh, and please use perl-5.7 + some custom patches if you want to use
this module" into the README, and even writing patches didn't have the
effect of getting this fixed.

]:-> just my Euro 0.02

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@opengroup.org |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About