develooper Front page | perl.perl5.porters | Postings from February 2001

Re: The State of The Unicode

Nick Ing-Simmons
February 20, 2001 06:54
Re: The State of The Unicode
Message ID:
Jarkko Hietaniemi <> writes:
>I appreciate your detailed analysis, it's certainly more detailed
>than what we have seen over the last few days.
>>From my viewpoint, however, the situation is as follows:
>(1) The current model, both externally and internally,
>    follows what is described by the Camel Mk3.  As the pumpkin
>    I'm somewhat obligated to abide by that, at least that's the
>    first degree approximation.  (Incidentally, the reason I think
>    the Camel is so vague was that when it was written the Unicode
>    model was beinh ripped to shreds to be rebuilt, in a discussion
>    not unlike the one we are having.)
>(2) The basic Unicode support seems to be in a rather good shape now.
>    What I mean by "basic" is that as long you don't start pulling your hair
>    over this very bytes vs UTF-8 vs characters issue, and just concatenate
>    strings, compare them, take their length, do regexes on then, etc, pretty
>    much everything seems to be working.
>Combine (1) and (2) and I see it as "what is broken, so what's there to
>fix" situation, let's call it (3).

Having reviewed things, tried a few and judging by the mail topics 
that recur here is my stab at (3):

   We need to spell out somewhere the "logical" (perl visible)
   semantics of strings. We need to remove the SvUTF8 and internal 
   stuff from the docs as seen by perl level user - move to 
   perlutf8guts or whatever.

   One true "bug":

   unpack('C',$str) != ord($str)   in some cases.
   Despite perlfunc saying 
    sub ordinal { unpack("c",$_[0]); } # same as ord()

   As far as I am aware this is the only remaining wart in the ASCII world.
   (I was not aware of it till this thread started as I am pack/unpack phobic.)

   (We can make it do what it does now in scope of 'use bytes' of course.)

   Encode::* are less elegant than they could be, I at least have 
   adding things there without documenting them, some documented
   things are stubs. I suggest I draft a new 'pod', post it here, 
   get agreement and then implement it.
   (Anyone else is welcome to draft it if they get there first.)

   In essence where we have 
     $length = from_to($string,'Unicode','foo');
     $length = from_to($string,'foo','Unicode');

   we will have 

     $encoded = encode_as('foo',$string); 
     $string  = decode_from('foo',$encoded); 
   the trivial cases  

     $encode  = encode_as_utf8($string);     # encode_as('utf8',$string)
     $string  = decode_from_utf8($encoded); 

   will be special cased.

   The main issue is getting the names right so they "read" correctly
   and don't confuse people.

   EBCDIC world has legacy reasons why ord('A') != 0x41.
   We need a formal statement that on EBCDIC platforms we don't
   use Unicode codepoints but "HybriCode" where U+0 .. U+255 map to IBM-1047
   as defined in ext/Encode/Encode/cp1047.ucm, but that other code points
   map as themselves.    _Or_ whatever the formal definition is.

   That is ord('A') == 0xC1, chr(0xC1) eq 'A' 

   We also need to define (at the internals level) whether utf8_upgrade('A') 
   produces "\x41",SvUTF8_on (as I think it should) or if it tries
   to apply the UTF-8 algorithm to the HybriCode code point, or if 
   it uses the UT?? thing intended for EBCDIC.
   We needs some cook-book examples for Content-Length:, LDAP, 
   etc. which use new documented API  e.g. 

   my $encoded = encode_as('utf8',$string);
   print $MAIL "Content-Length: ",length($encoded),"\n\n",$encoded; 

   We need to explain the risks of 'use bytes'.
   Personally I will never use it as it stands, but it is in the Camel...

Nick Ing-Simmons <>
Via, but not speaking for: Texas Instruments Ltd. Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About