develooper Front page | perl.perl5.porters | Postings from February 2001

Re: The State of The Unicode

Thread Previous | Thread Next
February 19, 2001 17:42
Re: The State of The Unicode
Message ID:
On Mon, Feb 19, 2001 at 08:18:07PM -0500, Andrew Pimlott wrote:
> On Tue, Feb 20, 2001 at 01:11:12AM +0100, wrote:
> > On Mon, Feb 19, 2001 at 06:07:14PM -0500, Andrew Pimlott wrote:
> > > On Mon, Feb 19, 2001 at 04:47:53PM -0600, Jarkko Hietaniemi wrote:
> > > > 
> > > > As far "what is broken", I do understand the concern of "exposing too
> > > > much of the internal representation" (which at the moment happens to
> > > > be UTF-8) to the user, having bytes and character is confusing at
> > > > best.  However, I'm not fully convinced that completely hiding it is
> > > > wise, either.  If from Perl level one cannot reach back to the bytes
> > > > comprising the UTF-8 representation of the characters, I feel we are
> > > > trying to pad the cell too softly.
> > > 
> > > My kingdom for one example.
> > 
> > If you step out of the box, it's easy to come up with examples.
> If you have time, can you be more concrete about your examples, for
> example with code snippets?  I have the idea that you're getting at
> something, but I can't quite figure out what.

It's 2:25 AM now. Too late for code snippets.

> > When ever you need to interface with something that has no understanding
> > of Unicode, for which everything is data, you want to be able to look
> > bytewise to your strings. When talking to a serial device for instance,
> > or a hard disk, whose capacity will be measured in bytes, not variable
> > width characters. Device drivers might not be commonly written in Perl,
> > but that doesn't mean it should be impossible.
> But these bytes will not be treated as the bytes of a utf8 encoded
> Perl string.  Do you mean that you will require that perl represent
> your byte strings as now, and throw an error if it ever felt the
> need to represent them in utf8?

What do you mean? Jarkko writes "If from Perl level one cannot reach back
to the bytes comprising the UTF-8 representation of the characters, I
feel we are trying to pad the cell too softly." You reply with "My kingdom
for one example." and on my examples you claim it's not getting the bytes.

The example is given to indicate *why* you want *want* to get the bytes.
No conversion. Just the bytes, ma'am.

> > But you don't have to go that low level. uuencode & base64 work with 8-bit
> > bytes. Taking your Unicode string, looking at it as bytes, uuencode it,
> > send it, receive it, uudecode it and looking at it again as Unicode will
> > work - as long as you can get to the bytes representation.
> Interesting example (cf as well the recent i18n of DNS); here are my
> questions.  What if perl were currently representing your string in
> your local encoding?  How would you coerce it to utf8?  Would you
> expect coersion to happen automatically inside a block with some
> pragma?  Would there be an explicit function?  Would there be a
> magic incantation like
>     $str .= "\x{ffff}"; chop $str;
> ?  Would you accept a method that would fail if perl ever decided to
> use UCS-4 as its wide string internal representation?


Think of programs exceeding 10 lines as split into logical units.
Modules, subroutines, whatever. One part of your program takes a Unicode
string, in say UTF-8, but who knows, some later version of Perl does
UTF-16 as well. The string is manupilated - as a Unicode string - using
normal string operations: concatenation, substitutions, chop, substr,
etc. Done with your manipulation, you want to compress it, and then
uuencode it because you need to email it to your boss.  The compression
and uuencode parts of your program (CPAN modules?)  don't care whether
they get UTF-8, UTF-16 or binary data. They act on bytes, and their
counterparts on the other end, legacy code dating from 1987, haven't
even heard of Unicode, so not acting on bytes isn't an option.

Unicode is just a protocol, and so is its encoding. Compression and
uuencoding are protocols as well, working on a different level and
totally orthogonal with Unicode.

> I tend to find it saner to do something like
>     my $utf8str = to_utf8 $str;
>     foreach my $byte (split /.*/, $utf8str) { ... }
> ie, so that the Perl characters in $utf8str are not the same
> characters as in $str, they are just byte values.  Is your concern
> merely efficiency, or am I missing something?

Your code fragment is unclear to me. Are you extracting the newlines
out of $utf8str? Cause that isn't code that extracts seperate bytes.
Regexes are supposed to work on characters, not bytes, so use of split
would not work.

I don't know how to extract the bytes from a UTF-8 string. But that's
not what we're discussing, we are discussing *why* you would want it.
Which function to use is a detail, only having relevancy if we cleared
the why issue.

> > A lot of existing compression and encryption software just look at the
> > data to be compressed or encrypted as bit or byte streams. There is no
> > reason to create Unicode aware versions of those tools before they can
> > be used on Unicode data. But to create Perl programs that compresses or
> > encrypts data that can be decompressed or decrypted with the existing
> > tools, your Perl program needs to be able to look at the data as a
> > sequence of bytes.
> The big question (to me) is, why would perl ever think of these
> strings as utf8 encoded in the first place?

Because you told Perl to. Or you let Perl read something that's UTF-8.

>                                              A rephrasing:  You can
> obviously write these programs with non-Unicode perl.  What do you
> expect could change in Unicode-enabled perl that could break them?

When it comes to a stand alone compression or encryption program, yes.
But what if you don't communicate using a pipe between two programs, but
use a compression or encryption module in a string manupilation program,
and you use method calls to communicate? One part of your program needs
to treat the string as being Unicode, the other part as bytes.

> I absolutely expect Unicode-enabled perl to work well with bytes.  I
> expect that it would have an internal representation that would hold
> strings in which each character value is <= 0xff efficiently.  What
> does this have to do with getting the bytes out of a possible other
> representation?

I gave you three examples of why you want to have the have the bytes.
If you wonder what those examples have to do with getting the bytes,
why bother asking for examples?


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About