develooper Front page | perl.perl5.porters | Postings from February 2001

Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)

Ilya Zakharevich
February 15, 2001 22:18
Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)
Message ID:
On Thu, Feb 15, 2001 at 10:48:33PM -0600, Jarkko Hietaniemi wrote:
> >              Like the qq manpage but generates Unicode for
> >              characters whose code points are greater than 128,
> >              or 0x80.
> > 
> > "Generates Unicode"?  What does it mean?  "Generates bytes"?  How do I
> My bad.  It should say "generates UTF-8".

I still have no idea what this may mean.

> > distinguish "generated bytes" from "generated Unicode"?
> > 
> > The principal idea of Unicode support in Perl is that it is
> > transparent (on the Perl level).  As far as I understand the
> You have misunderstood the intent, then.  qu// is NOT supposed
> to be transparent, it's exactly meant to be used where someone
> wants purposefully to BREAK the transparency and knowingly generate
> UTF-8 even for the 0x80...0xff range.  Think of it as "qq explicitly
> generating UTF-8".

[See above, but assume one randomly picked possible meaning]

Given that you cannot distinguish byte-encoded strings from
utf8-encoded strings from Perl code, I fail to see any difference
between qu// and qq//.

> And the mantra "Unicode should be transparent" doesn't really help
> here.  It simply isn't. 

Then this should be fixed.

> But that's not the ugliest part, I think I/O is the ugliest part:
> there one cannot escape the fact that one has to think about encoding
> and character sets and all that nastily non-transparent stuff.

Yes.  And it should be *the only* area when these issues should matter.

> We still haven't even solved I/O, really!  We do have a good start
> on it, though, with Encode and the new perlio, but we can't claim
> to having even close to a 'transparent' Unicode.  Not if we want to
> stay compatible with the 8-bit past, and the still existing vast 8-bit
> outside world.

I do not see why compatibility bothers you.  As far as representation
is transparent, you get an automatic compatibility (except as for XS
code, which needs to care about SvPVutf and SvPVbyte).

> Yes, \x{} was supposed to be the way to "produce UTF-8", always,
> always including the 0x80..0xff range.

Given transparency, you do not need such a thing.

> (An additional brain twister: if you don't care about EBCDIC you can
> choose not to care about this, either.  In EBCDIC, with the 5.6 scheme
> qq(\xC1) would be 'A', while qq(\x{C1}) would be 'Á'... the first one
> the native EBDIC CAPITAL A, the second one the Unicode LATIN CAPITAL
> LETTER A WITH ACUTE.  This because the EBCDIC character set is rather
> different from ASCII or ISO Latin 1, and doing Unicode on the
> 0x00..0xff range kinda assumes ISO Latin 1.  But with the new scheme,
> they are the same, since qq \xHH is equal to qq \x{HH}.  Backward
> compatibility, good.  If somebody *explicitly* wants \x{HH} to
> generate UTF-8, let him use qu.)

ord('A') should be the same on all the systems, unless use locale or
somesuch is in effect.  EBCDIC ports may behave as if they have an
implicit 'use locale' around each script.

[locales are just ways to assign a different cultural information to
 integers (=characters).  As Larry said, Perl should allow one use
 big5 for an internal cultural-info tables instead of unicode.
 Similarly, 'use locale' just loads a different table into the range
 0..255.  {BTW, It may make sense to make the "Unicode 0..255 range"
 available at some utf8-offset outside the UTF-8 range.  Say, at
 80000000..800000FF.  Then these chars may be useful even with 'use
 locale' present.} ]

Thus this (locales and EBCDIC) has nothing to do with the

> >   my $var = qq/bar/;
> > 
> > would produce the Unicode-marked string "bar".  Here the
> Madness.  (Sorry, couldn't resist :-)  This makes no sense,
> at least if you literally meant to use the literal string "bar".
> In out current Unicode model a SV (the PV) shall be marked Unicode
> *ONLY* if it contains UTF-8 encoded characters.

A string *must* be marked utf8 if was utf8-encoded and contained chars
above 127.  A string *may* be marked utf8 if it byte-encoded, but does
not contain chars above 127.

Ilya Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About