develooper Front page | perl.perl5.porters | Postings from February 2001

Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))

February 19, 2001 14:11
Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))
Message ID:
I have reasonable i18n background, and have been following the
Unicode issues on p5p for some time (though usually in arrears).
Now that I have caught up with the list, and also read the Unicode
section in Camel III, I would like make a few comments.  If I may be
a wild-eyed optimist, I'd also like to take a shot at bridging some
of the gap between Jarkko and Ilya.  Let me note in preface that I
will be happy to participate in constructive discussion (and doc
clarification) as best I can.

To start, let me say that I am disappointed with the Unicode
treatment in Camel III.  I was expecting it, for one, to elucidate
the motivation behind the shadowy "use bytes" I'd heard about on
p5p, but found it added nothing at all.  There were no complete
examples of where it may be useful.  The notion that a
"byte-oriented" program would be more likely (than a
"character-oriented" program) to care about the utf8 representation
of a string struck me as bassackwards.  It suggests that a binary
byte array got marked somehow as utf8; in that case, the correct
approach is to prevent that.  If anyone can give me a complete
example of "use bytes", I would appreciate it.

Let me go on to the exchange between Jarkko and Ilya.

On Thu, Feb 15, 2001 at 10:48:33PM -0600, Jarkko Hietaniemi wrote:
> Ilya:
> > I found no documentation in perlop!  
> > 
> >              Like the qq manpage but generates Unicode for
> >              characters whose code points are greater than 128,
> >              or 0x80.
> > 
> > "Generates Unicode"?  What does it mean?  "Generates bytes"?  How do I
> My bad.  It should say "generates UTF-8".

If the internal representation of a string is hidden, this statement
is meaningless.  Literally.  To my understanding, a logical Perl
string is always a list of (possibly large) positive integers, which
are usually taken to represent Unicode code points, but sometimes
they are just numbers.  This is true regardless of input method or
internal representation.

Conversely, if the phrase "generates UTF-8" has meaning, something
is broken in Perl or my understanding is deeply flawed.  (I hope to
find out which!)

By reading the surrounding context, I think that you are really
talking about /interpretation as/ Unicode, not /generation of/
Unicode.  This is getting closer to meaningful, but still isn't
there.  Note, I am not being pedantic; I really can't understand
what qu// means without going through this thought process.

> You have misunderstood the intent, then.  qu// is NOT supposed
> to be transparent, it's exactly meant to be used where someone
> wants purposefully to BREAK the transparency and knowingly generate
> UTF-8 even for the 0x80...0xff range.  Think of it as "qq explicitly
> generating UTF-8".

I'm a little confused as to whether the distinction between qq// and
qu// applies to the literal characters in the script, or only to the
\xHH and \x{HHH...} escapes.  I assume and hope it is only the
latter, but I'm not sure (I have an additional tirade prepared if
not :-) ).  Please clarify.

> Let me take you through the old Unicode model (5.6) and the current one,
> and the old non-Unicode, and the various ways to produce 'characters':
> 	B means a byte, BW a modulo 256-wrapped byte,
> 	C means an UTF-8 encoded multibyte character.

More terminology problems:  What is a "byte" in a Perl string?  What
is UTF-8?  You must either expand the definition of a Perl string
beyond what I stated above, or you must always speak in terms of
Unicode code points.  I think you mean,

    B means that the given number is interpreted as a code point in
      the local default coded character set, then converted
      (logically--not necessarly in storage!) to the corresponding
      Unicode code point,
    BW means the above except that the number is first truncated mod
    C means that the given number is (logically) simply taken as is.


Also, you confuse logical interprentation, and internal
representation, in places.  Internal representation _cannot_ matter,
or else Unicode support is IMO too confusing to use.  The table
below should only address logical interprentation.

Also, I'm not familiar with Perl's traditional locale support, but I
take it that, for all practical purposes, the default encoding only
has characters at positions 0x00-0xff, and that 0x00-0x7f represent
the same characters as Unicode 0x00-0x7F.  EBCDIC and JPerl (?) are
beyond the scope of this discussion.

> 				<5.6	5.6	>5.6
> 	qq \x00..\x7f		B	B	B

Note that, with the clarified definition (plus assumptions) above,
all the entries for 0x00-0x7f are equally "B" and "C".

> 	qq \x80..\xff		B	C	B
> 	qq \x{100}...		-	C	C
> 	v0..v127		-	B	B
> 	v128..v255		-	B	B
> 	v256...			-	C	C

This can't be right if v is to be used for version comparisons.  If
v128 and v129 are interpreted as a "bytes" in the local encoding,
then because of (logical) conversion to Unicode code points, it may
turn out that v128 > v129.  Again, this is irrespective of storage.

Further, it is very ugly for a new construct to be saddled with
legacy inconsistencies.

> 	qu \x00..\x7f		-	-	B
> 	qu \x80..\xff		-	-	C
> 	qu \x{100}...		-	-	C
> 	chr(0x00..0x7f)		B	B	B
> 	chr(0x80..0xff)		B	B	B
> 	chr(0x100...)		BW	C	C

I understand the case for crippling qq// and chr (and ord) for
compatibility with legacy code.  But it seems to scream for control
by a pragma, instead of adding a new function in place of qq// and
offering no sane replacement for chr.  Treating Unicode code points
0x80-0xff as second-class citizents is a real smack in the face to
the i18n effort.  (They will probably be your most common characters
after ASCII!)

> 	pack("C",0x00..0x7f)	B	B	B
> 	pack("C",0x80..0xff)	B	B	B
> 	pack("C",0x100...)	BW	BW	BW

A pragma to make the last one an error would be nice.

> 	pack("U",0x00..0x7f)	-	B	B
> 	pack("U",0x80..0xf7f)	-	B	C
> 	pack("U",0x100...)	-	C	C

Note that since the last two are for producing byte arrays (usually
to give to something external to perl), not strings for
character-oriented manipulation within perl, they aren't very
interesting here.

> >From the table you can see certain patterns, though.

They are dubious patterns.  The last two are just for creating
binary structures, and make sense entirely intependent of core
Unicode support.  Of the remaining four, three have the ugly BBC (==
CBC, as I mentioned) for legacy compatibility.  qu// has a BCC (==
CCC) analog, but chr has no clean analog, and v should never have
been sullied, because it was introduced with core Unicode support.

> And the mantra "Unicode should be transparent" doesn't really help
> here.  It simply isn't.  We can take apart (unpack("C")) an UTF-8
> encoded character and see its guts, the bytes of UTF-8 encoding.

God, I hope that's not true.  Please say you were mistaken.

> But that's not the ugliest part, I think I/O is the ugliest part:

Yes, I/O is tricky, but is entirely its own module, and should not
affect the model.

> Yes, \x{} was supposed to be the way to "produce UTF-8", always,
> always including the 0x80..0xff range.  But at least my gut feeling
> is that the \x{} is not 'different enough' to warrant that different
> semantics in that 8-bit range.  I find it too sneaky that qq \xHH
> and qq \x{HH} would produce different strings, bytewise.  Yes, they
> should _compare_ the same -- and they do, that's the transparency part.

This difference can't be.  IOW, the user should have no direct
control over how things are represented internally; moreover, he
shouldn't want it.  Things that compare the same are (logically) the
same.  Perl should be able to change the representation as it sees
fit, at any time.  Otherwise, I fall back on my stance that Unicode
support is be too confusing to use.

(This statement of yours means that I was probably mistaken in
several places above concerning what I thought you meant.  But it
was what you should have meant. :-) )

If you don't agree, please give me an example where you would care
about producing "different strings, bytewise".

Sorry if I am critical of the current work, but with feedback, I
hope I can help improve things.

Andrew Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About