develooper Front page | perl.perl5.porters | Postings from February 2001

Re: The State of The Unicode

Thread Previous | Thread Next
From:
perl5-porters
Date:
February 20, 2001 06:15
Subject:
Re: The State of The Unicode
Message ID:
96tu7b$3g1$1@post.home.lunix
Maybe I'm just missing the point here, but having functions that
expose the internals seems to me the completely wrong way to 
handle the "how many octets is this utf8 encoded string".

The original string in the perl model ought to be just
a sequence of integers. and we need a function (could be a
subfunction of unpack or whatever), that takes this
sequence of encoded in UTF8, and returns a different sequence
of integers, the octets in the UTF8 encoding.

So to get the length in octets of an unicode string you would 
just do:

$string = "any string even containing high codepoints";
$encoded = toutf8($string);
print length $encoded;

and $encoded would for the rest just be a normal perl string,
which could in fact be internally encoded in all the different ways.
Only the user would know this new sequence of integers is
to be understood as a sequence of octets.

For camel compatibility you could have a number of functions in
use bytes where bytes::length is just an alias for length(toutf8(@_)).

In short, UTF8 is just an encoding of the original sequence of
integers, but you should get hold of that by asking perl to
encode the sequence of integers for you, NOT by assuming that is
their internal form and then exposing that. (the difference of 
course being that what I write above still works perfectly well
even if the perl internal form were UCS4 or whatever).

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About