develooper Front page | perl.perl5.porters | Postings from February 2001

Re: The State of The Unicode

Thread Previous | Thread Next
February 19, 2001 17:18
Re: The State of The Unicode
Message ID:
On Tue, Feb 20, 2001 at 01:11:12AM +0100, wrote:
> On Mon, Feb 19, 2001 at 06:07:14PM -0500, Andrew Pimlott wrote:
> > On Mon, Feb 19, 2001 at 04:47:53PM -0600, Jarkko Hietaniemi wrote:
> > > 
> > > As far "what is broken", I do understand the concern of "exposing too
> > > much of the internal representation" (which at the moment happens to
> > > be UTF-8) to the user, having bytes and character is confusing at
> > > best.  However, I'm not fully convinced that completely hiding it is
> > > wise, either.  If from Perl level one cannot reach back to the bytes
> > > comprising the UTF-8 representation of the characters, I feel we are
> > > trying to pad the cell too softly.
> > 
> > My kingdom for one example.
> If you step out of the box, it's easy to come up with examples.

If you have time, can you be more concrete about your examples, for
example with code snippets?  I have the idea that you're getting at
something, but I can't quite figure out what.

> When ever you need to interface with something that has no understanding
> of Unicode, for which everything is data, you want to be able to look
> bytewise to your strings. When talking to a serial device for instance,
> or a hard disk, whose capacity will be measured in bytes, not variable
> width characters. Device drivers might not be commonly written in Perl,
> but that doesn't mean it should be impossible.

But these bytes will not be treated as the bytes of a utf8 encoded
Perl string.  Do you mean that you will require that perl represent
your byte strings as now, and throw an error if it ever felt the
need to represent them in utf8?

> But you don't have to go that low level. uuencode & base64 work with 8-bit
> bytes. Taking your Unicode string, looking at it as bytes, uuencode it,
> send it, receive it, uudecode it and looking at it again as Unicode will
> work - as long as you can get to the bytes representation.

Interesting example (cf as well the recent i18n of DNS); here are my
questions.  What if perl were currently representing your string in
your local encoding?  How would you coerce it to utf8?  Would you
expect coersion to happen automatically inside a block with some
pragma?  Would there be an explicit function?  Would there be a
magic incantation like

    $str .= "\x{ffff}"; chop $str;

?  Would you accept a method that would fail if perl ever decided to
use UCS-4 as its wide string internal representation?

I tend to find it saner to do something like

    my $utf8str = to_utf8 $str;
    foreach my $byte (split /.*/, $utf8str) { ... }

ie, so that the Perl characters in $utf8str are not the same
characters as in $str, they are just byte values.  Is your concern
merely efficiency, or am I missing something?

> A lot of existing compression and encryption software just look at the
> data to be compressed or encrypted as bit or byte streams. There is no
> reason to create Unicode aware versions of those tools before they can
> be used on Unicode data. But to create Perl programs that compresses or
> encrypts data that can be decompressed or decrypted with the existing
> tools, your Perl program needs to be able to look at the data as a
> sequence of bytes.

The big question (to me) is, why would perl ever think of these
strings as utf8 encoded in the first place?  A rephrasing:  You can
obviously write these programs with non-Unicode perl.  What do you
expect could change in Unicode-enabled perl that could break them?

I absolutely expect Unicode-enabled perl to work well with bytes.  I
expect that it would have an internal representation that would hold
strings in which each character value is <= 0xff efficiently.  What
does this have to do with getting the bytes out of a possible other


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About