develooper Front page | perl.perl5.porters | Postings from February 2001

Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)

Jarkko Hietaniemi
February 15, 2001 21:42
Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)
Message ID:
Argh.  Let's try that table again.  I got the \xHH and \x{HH} muddled.

 				<5.6	5.6	>5.6
 	qq \x00..\x7f		B	B	B
 	qq \x80..\xff		B	B	B
 	qq \x{00}..\x{7f}	-	B	B
 	qq \x{80}..\x{ff}	-	C	B
 	qq \x{100}...		-	C	C
 	v0..v127		-	B	B
 	v127..v255		-	B	B
 	v256...			-	C	C
 	qu \x00..\x7f		-	-	B
 	qu \x80..\xff		-	-	C
 	qu \x{00}..\x{7f}	-	-	B
 	qu \x{80}..\x{ff}	-	-	C
 	qu \x{100}...		-	-	C
 	chr(0x00..0x7f)		B	B	B
 	chr(0x80..0xff)		B	B	B
 	chr(0x100...)		BW	C	C
 	pack("C",0x00..0x7f)	B	B	B
 	pack("C",0x80..0xff)	B	B	B
 	pack("C",0x100...)	BW	BW	BW
 	pack("U",0x00..0x7f)	-	B	B
 	pack("U",0x80..0xff)	-	B	C
 	pack("U",0x100...)	-	C	C
 	B means a byte, BW a modulo 256-wrapped byte,
 	C means an UTF-8 encoded multibyte character.

Because the table in my first try was a bit off, some my explanations
were also a bit off: I spoke about the combination qu plus \x{HH} plus
HH in the 0x80..0xff range, when I should have been talking just about
qu plus \x plus HH in the said range: in qu the 0x80..0xff generates UTF-8,
as opposed to qq where it produces bytes.

Another way to look at it is that whether the \x has the curly braces
or not has no (direct) relevance to the UTF-8-ness of the result:
it's only that with the braces one can contain more hexdigits than two,
which implicitly causes the result to be encoded in UTF-8.  So while
it's not the braces, it is the codevalue: codepoint > 0xff, ta-dah,
UTF-8 encoding -- unless someone wants UTF-8 already earlier, in which
case they should pull out qu from their toolbox.

To recap what happened from the viewpoint of our developing Unicode
external model *and* our internal implementation: since the \x{HH} within
qq behaved differently than all the rest, it was brought back into line.
But that meant that we lost the way to knowingly generate in compile
time UTF-8, which some of us seemed to need, or at least want.  Ergo,
qu was born.

Summa summarum: if you want to be transparent, use qq and never even
think about qu.  If you don't want to be transparent and knowingly
generate UTF-8, use qu.

$jhi++; #
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About