develooper Front page | perl.perl5.porters | Postings from February 2001

Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))

Thread Previous | Thread Next
From:
Nick Ing-Simmons
Date:
February 20, 2001 02:46
Subject:
Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))
Message ID:
200102201045.KAA23362@mikado.tiuk.ti.com
Andrew Pimlott <andrew@pimlott.ne.mediaone.net> writes:
>
>To start, let me say that I am disappointed with the Unicode
>treatment in Camel III.  I was expecting it, for one, to elucidate
>the motivation behind the shadowy "use bytes" I'd heard about on
>p5p, but found it added nothing at all.  There were no complete
>examples of where it may be useful.  The notion that a
>"byte-oriented" program would be more likely (than a
>"character-oriented" program) to care about the utf8 representation
>of a string struck me as bassackwards.  It suggests that a binary
>byte array got marked somehow as utf8; in that case, the correct
>approach is to prevent that.  If anyone can give me a complete
>example of "use bytes", I would appreciate it.
>
>Let me go on to the exchange between Jarkko and Ilya.
>
>On Thu, Feb 15, 2001 at 10:48:33PM -0600, Jarkko Hietaniemi wrote:
>> Ilya:
>> > I found no documentation in perlop!  
>> > 
>> >              Like the qq manpage but generates Unicode for
>> >              characters whose code points are greater than 128,
>> >              or 0x80.
>> > 
>> > "Generates Unicode"?  What does it mean?  "Generates bytes"?  How do I
>> 
>> My bad.  It should say "generates UTF-8".
>
>If the internal representation of a string is hidden, this statement
>is meaningless.  Literally.  

Agreed. But the representation is not hidden from C that makes up 
the perl core. And (as I understand it) 'use bytes' explcitly
exposes the representation to perl.

>To my understanding, a logical Perl
>string is always a list of (possibly large) positive integers, which
>are usually taken to represent Unicode code points, but sometimes
>they are just numbers.  This is true regardless of input method or
>internal representation.

I agree so far.

>
>Conversely, if the phrase "generates UTF-8" has meaning, something
>is broken in Perl or my understanding is deeply flawed.  (I hope to
>find out which!)
>
>By reading the surrounding context, I think that you are really
>talking about /interpretation as/ Unicode, not /generation of/
>Unicode.  This is getting closer to meaningful, but still isn't
>there.  Note, I am not being pedantic; I really can't understand
>what qu// means without going through this thought process.
>
>> You have misunderstood the intent, then.  qu// is NOT supposed
>> to be transparent, it's exactly meant to be used where someone
>> wants purposefully to BREAK the transparency and knowingly generate
>> UTF-8 even for the 0x80...0xff range.  Think of it as "qq explicitly
>> generating UTF-8".
>
>I'm a little confused as to whether the distinction between qq// and
>qu// applies to the literal characters in the script, or only to the
>\xHH and \x{HHH...} escapes.  I assume and hope it is only the
>latter, but I'm not sure (I have an additional tirade prepared if
>not :-) ).  Please clarify.

Without going and looking at the code that implements qu I cannot
say for certain. 

>
>> Let me take you through the old Unicode model (5.6) and the current one,
>> and the old non-Unicode, and the various ways to produce 'characters':
>> 
>> 	B means a byte, BW a modulo 256-wrapped byte,
>> 	C means an UTF-8 encoded multibyte character.
>
>More terminology problems:  What is a "byte" in a Perl string?  

The dicussion was/is I think about representation again.
Thus:
   "byte" (B)  meant literal 0..255 
          (BW) meant value > 255 stored as (value & 0xFF)    
   C meant sequence of UTF-8 encoded values, and the SvUTF8 bit turned on. 

>What
>is UTF-8?  You must either expand the definition of a Perl string
>beyond what I stated above, or you must always speak in terms of
>Unicode code points.  I think you mean,
>
>    B means that the given number is interpreted as a code point in
>      the local default coded character set, then converted
>      (logically--not necessarly in storage!) to the corresponding
>      Unicode code point,
>    BW means the above except that the number is first truncated mod
>       256,
>    C means that the given number is (logically) simply taken as is.
>
>Yes?
>
>Also, you confuse logical interprentation, and internal
>representation, in places.  Internal representation _cannot_ matter,
>or else Unicode support is IMO too confusing to use.  The table
>below should only address logical interprentation.

I think that (EBCDIC aside) we are all agreed on the logical interpretation.

The discussion is about how the internals achieve that, with an 
eye on efficiency, and also perhaps (for some people at least) to 
allow 'use bytes' to have some predicatability.

>
>Also, I'm not familiar with Perl's traditional locale support, but I
>take it that, for all practical purposes, the default encoding only
>has characters at positions 0x00-0xff, and that 0x00-0x7f represent
>the same characters as Unicode 0x00-0x7F.  EBCDIC and JPerl (?) are
>beyond the scope of this discussion.

EBCDIC _has_ to be included in the dicussion.


>
>> 				<5.6	5.6	>5.6
>> 
>> 	qq \x00..\x7f		B	B	B
>
>Note that, with the clarified definition (plus assumptions) above,
>all the entries for 0x00-0x7f are equally "B" and "C".
>
>> 	qq \x80..\xff		B	C	B
>> 	qq \x{100}...		-	C	C
>>
>> 	v0..v127		-	B	B
>> 	v128..v255		-	B	B
>> 	v256...			-	C	C
>
>This can't be right if v is to be used for version comparisons.  If
>v128 and v129 are interpreted as a "bytes" in the local encoding,
>then because of (logical) conversion to Unicode code points, it may
>turn out that v128 > v129.  Again, this is irrespective of storage.

The discussion was about how perl stores the code points internally.

>
>Further, it is very ugly for a new construct to be saddled with
>legacy inconsistencies.
>
>> 	qu \x00..\x7f		-	-	B
>> 	qu \x80..\xff		-	-	C
>> 	qu \x{100}...		-	-	C
>> 
>> 	chr(0x00..0x7f)		B	B	B
>> 	chr(0x80..0xff)		B	B	B
>> 	chr(0x100...)		BW	C	C
>
>I understand the case for crippling qq// and chr (and ord) for
>compatibility with legacy code.  But it seems to scream for control
>by a pragma, instead of adding a new function in place of qq// and
>offering no sane replacement for chr.  Treating Unicode code points
>0x80-0xff as second-class citizents is a real smack in the face to
>the i18n effort.  (They will probably be your most common characters
>after ASCII!)

It is because they are common that the optimization of holding
them as single bytes is used.

>
>> 	pack("C",0x00..0x7f)	B	B	B
>> 	pack("C",0x80..0xff)	B	B	B
>> 	pack("C",0x100...)	BW	BW	BW
>
>A pragma to make the last one an error would be nice.
>
>> 	pack("U",0x00..0x7f)	-	B	B
>> 	pack("U",0x80..0xf7f)	-	B	C
>> 	pack("U",0x100...)	-	C	C
>
>Note that since the last two are for producing byte arrays (usually
>to give to something external to perl), not strings for
>character-oriented manipulation within perl, they aren't very
>interesting here.

On the contrary - how one converts the agreed "logical" semantics
to sequences of octets to be passed to external consumers
is one area where this stuff is visible to the programmer-in-perl
and not just of interest to the programmers-of-perl.

Hence the > 5.6 change to pack('U',...) - the 5.6 form muddles
the logical character to external conversion by leaving code points
128..256 un-encoded. The > 5.6 case corrects that by producing 
a stream of pure UTF-8.

>
>> >From the table you can see certain patterns, though.
>
>They are dubious patterns.  The last two are just for creating
>binary structures, and make sense entirely intependent of core
>Unicode support.  Of the remaining four, three have the ugly BBC (==
>CBC, as I mentioned) for legacy compatibility.  qu// has a BCC (==
>CCC) analog, but chr has no clean analog, and v should never have
>been sullied, because it was introduced with core Unicode support.
>
>> And the mantra "Unicode should be transparent" doesn't really help
>> here.  It simply isn't.  We can take apart (unpack("C")) an UTF-8
>> encoded character and see its guts, the bytes of UTF-8 encoding.
>
>God, I hope that's not true.  Please say you were mistaken.

It is not clear that statement is true, or should be true.
But would seem to say that pack/unpack C ignores the SvUTF8 flag
and gives you the raw internals. It could just as easily be coded
to give an error if you asked for 'C' and logical value was >= 256.

A well defined _extension_ to peer into the representation 
may be useful - and we have one Encode.xs.

The main reason people think they want it is when they need
UTF-8 encoded sequence of octets for a code point.  
(e.g. LDAP, MIME's Content-Type: text/plain; charset=utf8 etc.)
They say - hey perl has this in UTF-8 already (it may not)
why won't it give it to me?

So you get things like:
   chop($string .= chr(256));  # force the representation
   {use bytes; $utf8 = $string};
   print MAIL $utf8;

>
>> But that's not the ugliest part, I think I/O is the ugliest part:
>
>Yes, I/O is tricky, but is entirely its own module, and should not
>affect the model.
>
>> Yes, \x{} was supposed to be the way to "produce UTF-8", always,
>> always including the 0x80..0xff range.  But at least my gut feeling
>> is that the \x{} is not 'different enough' to warrant that different
>> semantics in that 8-bit range.  I find it too sneaky that qq \xHH
>> and qq \x{HH} would produce different strings, bytewise.  Yes, they
>> should _compare_ the same -- and they do, that's the transparency part.
>
>This difference can't be.  IOW, the user should have no direct
>control over how things are represented internally; moreover, he
>shouldn't want it.  

Those of us that know what we are doing want to influence the way 
perl encodes strings internally. If we know we are going to add
codepoints >= 256 eventually we may want to hint that to perl 
so that it encodes 128..255 as well and does not have to 
shift.

>Things that compare the same are (logically) the
>same.  

They are - the dispute (in so far as there is any dispute) is 
over the the internal representation a perl language construct
chooses. Apart from pack's U & C this does not leak into the 
"logical" world.

>Perl should be able to change the representation as it sees
>fit, at any time.  Otherwise, I fall back on my stance that Unicode
>support is be too confusing to use.

>
>(This statement of yours means that I was probably mistaken in
>several places above concerning what I thought you meant.  But it
>was what you should have meant. :-) )
>
>If you don't agree, please give me an example where you would care
>about producing "different strings, bytewise".




>
>Sorry if I am critical of the current work, but with feedback, I
>hope I can help improve things.
>
>Andrew
-- 
Nick Ing-Simmons <nik@tiuk.ti.com>
Via, but not speaking for: Texas Instruments Ltd.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About