develooper Front page | perl.perl5.porters | Postings from February 2001

Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)

Jarkko Hietaniemi
February 16, 2001 06:44
Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)
Message ID:
On Fri, Feb 16, 2001 at 01:18:37AM -0500, Ilya Zakharevich wrote:
> On Thu, Feb 15, 2001 at 10:48:33PM -0600, Jarkko Hietaniemi wrote:
> > >              Like the qq manpage but generates Unicode for
> > >              characters whose code points are greater than 128,
> > >              or 0x80.
> > > 
> > > "Generates Unicode"?  What does it mean?  "Generates bytes"?  How do I
> > 
> > My bad.  It should say "generates UTF-8".
> I still have no idea what this may mean.

The number 0x0 mapping into bytes 0xc2 0x80

> Given that you cannot distinguish byte-encoded strings from
> utf8-encoded strings from Perl code, I fail to see any difference

Yes, you can.  unpack("C*", ...), unless, of course you intend to
change that, too.  { use bytes; length() }.  unpack("U*") will croak
if you feed it malformed UTF-8.

> > Yes, \x{} was supposed to be the way to "produce UTF-8", always,
> > always including the 0x80..0xff range.
> Given transparency, you do not need such a thing.

Wrong.  There are standards and protocols, and other pieces of
software, out there that *require* producing UTF-8.  IIRC LDAP
is one of those outside bits.  Java would be another [*].
Perl must be able to interface with the outside world.

[*] Though don't get me started on how Java's readUTF8() and writeUTF8()
do not do real UTF-8 as defined by the RFC :-)

> ord('A') should be the same on all the systems, unless use locale or

It isn't.
In EBCDIC that produces 0xC1, or 193.
It might be nice if it did.
Changing it would break existing code.

> somesuch is in effect.  EBCDIC ports may behave as if they have an

This means that you want to impose ISO Latin 1 on everyone in 8-bit world.

> implicit 'use locale' around each script.

'use locale' has *nothing* to do with this.

> [locales are just ways to assign a different cultural information to
>  integers (=characters).  As Larry said, Perl should allow one use

I wish they where -- but they are not.  That's not how they have been
(very weakly defined by standards and (badly) implemented by vendors.
For one thing, they have very little to with character encodings.

>  big5 for an internal cultural-info tables instead of unicode.
>  Similarly, 'use locale' just loads a different table into the range
>  0..255.  {BTW, It may make sense to make the "Unicode 0..255 range"

Sorry, Ilya, that's completely not what happens.  If you suggest
changing that, you suggest changing the semantics of 'use locale'
so completely that you are not talking Perl 5 anymore.

>  available at some utf8-offset outside the UTF-8 range.  Say, at
>  80000000..800000FF.  Then these chars may be useful even with 'use
>  locale' present.} ]
> Thus this (locales and EBCDIC) has nothing to do with the
> transparency.
> > >   my $var = qq/bar/;
> > > 
> > > would produce the Unicode-marked string "bar".  Here the
> > 
> > Madness.  (Sorry, couldn't resist :-)  This makes no sense,
> > at least if you literally meant to use the literal string "bar".
> > In out current Unicode model a SV (the PV) shall be marked Unicode
> > *ONLY* if it contains UTF-8 encoded characters.
> A string *must* be marked utf8 if was utf8-encoded and contained chars
> above 127.  A string *may* be marked utf8 if it byte-encoded, but does
> not contain chars above 127.

Your sentence is in opposition with our existing Unicode model and
implementation, which seems to be working rather nicely, so you must
have a complete alternative implementation in your backpocket.

Your sentence is essentially saying that utf8-marking is a hint (that
might be false) that it the string might contain chars above 127,
instead of the current implementation where it is a guarantee of that.
Unsurprisingly, I find the current model much cleaner.

> Ilya

$jhi++; #
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About