develooper Front page | perl.perl5.porters | Postings from February 2001

Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)

Thread Previous | Thread Next
Jarkko Hietaniemi
February 15, 2001 20:48
Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)
Message ID:
> > > the qu// horror, but on par with forcing h2xs to produce a
> > 
> > >From the documentation of qu// in perlop, I fail to see why it's anywhere
> > near as horrible as changes to the numeric operators. I would have thought
> > that qu// was less bad, not the other way round.
> I found no documentation in perlop!  
>              Like the qq manpage but generates Unicode for
>              characters whose code points are greater than 128,
>              or 0x80.
> "Generates Unicode"?  What does it mean?  "Generates bytes"?  How do I

My bad.  It should say "generates UTF-8".

> distinguish "generated bytes" from "generated Unicode"?
> The principal idea of Unicode support in Perl is that it is
> transparent (on the Perl level).  As far as I understand the

You have misunderstood the intent, then.  qu// is NOT supposed
to be transparent, it's exactly meant to be used where someone
wants purposefully to BREAK the transparency and knowingly generate
UTF-8 even for the 0x80...0xff range.  Think of it as "qq explicitly
generating UTF-8".

> implementation, the above documentation says:
>   qu// is absolutely identical to qq// (except for some bugs which are
>   not fixed yet).

> And an operation which produces results identical to qq// should be
> named qq//.  Period.  If you want to extend/change the details of how

It's not identical.  It's different for the said area, 0x80..0xff.

Let me take you through the old Unicode model (5.6) and the current one,
and the old non-Unicode, and the various ways to produce 'characters':

				<5.6	5.6	>5.6

	qq \x00..\x7f		B	B	B
	qq \x80..\xff		B	C	B
	qq \x{100}...		-	C	C

	v0..v127		-	B	B
	v127..v255		-	B	B
	v256...			-	C	C

	qu \x00..\x7f		-	-	B
	qu \x80..\xff		-	-	C
	qu \x{100}...		-	-	C

	chr(0x00..0x7f)		B	B	B
	chr(0x80..0xff)		B	B	B
	chr(0x100...)		BW	C	C

	pack("C",0x00..0x7f)	B	B	B
	pack("C",0x80..0xff)	B	B	B
	pack("C",0x100...)	BW	BW	BW

	pack("U",0x00..0x7f)	-	B	B
	pack("U",0x80..0xf7f)	-	B	C
	pack("U",0x100...)	-	C	C

	B means a byte, BW a modulo 256-wrapped byte,
	C means an UTF-8 encoded multibyte character.

(I hope I got the table right, and I hope I didn't miss any cases, I
kind of tired.  I also purposefully left out 5.6 + use utf8 since that
was found out to be a bad mistake: lexical scope and the utf8 pragma
should play no role in what kind of data a scalar gets.)

From the table you can see certain patterns, though.  First three
compile time ways, then three run-time ways.  Another pattern is to
notice the BBC in 5.6 (well, except in pack("C"), which wraps)
eveywhere -- except qq!  We have five ways but one of them is
different, it produces UTF-8 earlier than the other methods.  Ooops.
Messy.  That's basically why qu: to keep qq consistent with the other
character-producing methods, and a compile-time (as opposed to pack("U"))
counterpart to pack("U").

And the mantra "Unicode should be transparent" doesn't really help
here.  It simply isn't.  We can take apart (unpack("C")) an UTF-8
encoded character and see its guts, the bytes of UTF-8 encoding.
But that's not the ugliest part, I think I/O is the ugliest part:
there one cannot escape the fact that one has to think about encoding
and character sets and all that nastily non-transparent stuff.
We still haven't even solved I/O, really!  We do have a good start
on it, though, with Encode and the new perlio, but we can't claim
to having even close to a 'transparent' Unicode.  Not if we want to
stay compatible with the 8-bit past, and the still existing vast 8-bit
outside world.

Yes, \x{} was supposed to be the way to "produce UTF-8", always,
always including the 0x80..0xff range.  But at least my gut feeling
is that the \x{} is not 'different enough' to warrant that different
semantics in that 8-bit range.  I find it too sneaky that qq \xHH
and qq \x{HH} would produce different strings, bytewise.  Yes, they
should _compare_ the same -- and they do, that's the transparency part.

(An additional brain twister: if you don't care about EBCDIC you can
choose not to care about this, either.  In EBCDIC, with the 5.6 scheme
qq(\xC1) would be 'A', while qq(\x{C1}) would be 'Á'... the first one
the native EBDIC CAPITAL A, the second one the Unicode LATIN CAPITAL
LETTER A WITH ACUTE.  This because the EBCDIC character set is rather
different from ASCII or ISO Latin 1, and doing Unicode on the
0x00..0xff range kinda assumes ISO Latin 1.  But with the new scheme,
they are the same, since qq \xHH is equal to qq \x{HH}.  Backward
compatibility, good.  If somebody *explicitly* wants \x{HH} to
generate UTF-8, let him use qu.)

> qq() operates (especially details invisible to the rest of Perl), there
> is already a mechanism for this: overloaded constants.
>   use unicode ':literals';
>   my $var = qq/bar/;
> would produce the Unicode-marked string "bar".  Here the

Madness.  (Sorry, couldn't resist :-)  This makes no sense,
at least if you literally meant to use the literal string "bar".
In out current Unicode model a SV (the PV) shall be marked Unicode
*ONLY* if it contains UTF-8 encoded characters.

> qq-overloading subroutine may look as simple as
>   {
>     my $empty = substr "\x{101}", 1;
>     sub ($) {shift . $empty}

I'm sorry but this looks just like the kind of horrible hack we want
to stay away from.  To mark a string Unicode you create by magical
means an empty string that has the "Unicode flag" on and concenate
with that empty string to propagate that "Unicode flag"?  Yechhhh.

>   }
> Or call this subroutine qu, and use it as
>   my $var = u"bar";
> Ilya

$jhi++; #
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About