develooper Front page | perl.perl5.porters | Postings from February 2001

Re: The State of The Unicode

From:
andrew
Date:
February 19, 2001 18:42
Subject:
Re: The State of The Unicode
Message ID:
20010219214125.N17705@pimlott.ne.mediaone.net
On Tue, Feb 20, 2001 at 02:44:56AM +0100, abigail@foad.org wrote:
> On Mon, Feb 19, 2001 at 08:18:07PM -0500, Andrew Pimlott wrote:
> > On Tue, Feb 20, 2001 at 01:11:12AM +0100, abigail@foad.org wrote:
> > > When ever you need to interface with something that has no understanding
> > > of Unicode, for which everything is data, you want to be able to look
> > > bytewise to your strings. When talking to a serial device for instance,
> > > or a hard disk, whose capacity will be measured in bytes, not variable
> > > width characters. Device drivers might not be commonly written in Perl,
> > > but that doesn't mean it should be impossible.
> > 
> > But these bytes will not be treated as the bytes of a utf8 encoded
> > Perl string.  Do you mean that you will require that perl represent
> > your byte strings as now, and throw an error if it ever felt the
> > need to represent them in utf8?
> 
> What do you mean? Jarkko writes "If from Perl level one cannot reach back
> to the bytes comprising the UTF-8 representation of the characters, I
> feel we are trying to pad the cell too softly." You reply with "My kingdom
> for one example." and on my examples you claim it's not getting the bytes.
> 
> The example is given to indicate *why* you want *want* to get the bytes.
> No conversion. Just the bytes, ma'am.

No, I'm claming that it's not getting bytes from a UTF-8
representation.  How did it get a UTF-8 representation?  (I see your
answer below, and will discuss it there.)

I'm saying, in your example, it's getting bytes from bytes.  Just
like in today's perl.  No mystery there.

> Think of programs exceeding 10 lines as split into logical units.
> Modules, subroutines, whatever. One part of your program takes a Unicode
> string, in say UTF-8, but who knows, some later version of Perl does
> UTF-16 as well. The string is manupilated - as a Unicode string - using
> normal string operations: concatenation, substitutions, chop, substr,
> etc. Done with your manipulation, you want to compress it, and then
> uuencode it because you need to email it to your boss.  The compression
> and uuencode parts of your program (CPAN modules?)  don't care whether
> they get UTF-8, UTF-16 or binary data. They act on bytes, and their
> counterparts on the other end, legacy code dating from 1987, haven't
> even heard of Unicode, so not acting on bytes isn't an option.

Ok, you want to send it to your boss.  You must know at that point
that you want UTF-8.  Wouldn't it make the most sense to explicitly
call a function that says "convert to UTF-8" before passing the
string to the compression module?  Instead of saying, "operate on
the internal representation"?  That's my whole point.

> > I tend to find it saner to do something like
> > 
> >     my $utf8str = to_utf8 $str;
> >     foreach my $byte (split /.*/, $utf8str) { ... }
> > 
> > ie, so that the Perl characters in $utf8str are not the same
> > characters as in $str, they are just byte values.  Is your concern
> > merely efficiency, or am I missing something?
> 
> Your code fragment is unclear to me. Are you extracting the newlines
> out of $utf8str? Cause that isn't code that extracts seperate bytes.
> Regexes are supposed to work on characters, not bytes, so use of split
> would not work.

Sorry, I intended that

    - $str is a normal string of Unicode characters.
      - substr($str, 0, 1) is 0xc0
    - $utf8str is a string of bytes, where the bytes are the UTF-8
      encoding of $str.
      - substr($str, 0, 1) is 0xc3
      - substr($str, 1, 2) is 0x80
    - the loop iterates over every "Perl character" in the string,
      ie, over every byte in the UTF-8 representation.  (I meant the
      re that matches everywhere; sorry for the confusion.)

> I don't know how to extract the bytes from a UTF-8 string. But that's
> not what we're discussing, we are discussing *why* you would want it.
> Which function to use is a detail, only having relevancy if we cleared
> the why issue.

No, there is no discusson over why you would want bytes from a UTF-8
string.  Just like there's no discussion over why you would want
bytes from an ISO-8859-1 string, or an EUC-JP string.  "Which
function to use" is all I'm discussing.

It's important, because I'm asking whether you're doing a nice,
clean, explicit "give me the UTF-8 representation of this string",
in which each "Perl character" is a byte (the same way people deal
with UTF-8 in Perl today); or whether you're saying, "give me
whatever the internal representation happens to be (this version of
Perl, this architecture, this locale, this moment)"; or "coerce the
internal representation to UTF-8 now"; or whether we want to support
any interface that only supports UTF-8 and the locale-native
encoding, when there's a great variety of character encodings out
there; or to tie perl to UTF-8 forever; or to create an interface
that will inevitably be incompatible with non-Unicode perl.

> I gave you three examples of why you want to have the have the bytes.
> If you wonder what those examples have to do with getting the bytes,
> why bother asking for examples?

Is the above clearer?

In short, "give me the bytes" is an awful interface, except for
debugging or guts-level code.  "give me UTF-8" makes perfect sense.

Andrew



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About