develooper Front page | perl.perl5.porters | Postings from February 2001

Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)

From:
Ilya Zakharevich
Date:
February 16, 2001 12:53
Subject:
Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)
Message ID:
20010216155335.D20979@math.ohio-state.edu
On Fri, Feb 16, 2001 at 08:44:26AM -0600, Jarkko Hietaniemi wrote:
> > Given that you cannot distinguish byte-encoded strings from
> > utf8-encoded strings from Perl code, I fail to see any difference
> 
> Yes, you can.  unpack("C*", ...), unless, of course you intend to
> change that, too.

There should be no operations in the core which expose the internal
representation.  Only modules - like Devel::Peek.  The &|^ disaster
should have taught us this.

> { use bytes; length() }.  unpack("U*") will croak
> if you feed it malformed UTF-8.

There should be no 'use bytes'.

> > Given transparency, you do not need such a thing.
> 
> Wrong.  There are standards and protocols, and other pieces of
> software, out there that *require* producing UTF-8.  IIRC LDAP
> is one of those outside bits.  Java would be another [*].
> Perl must be able to interface with the outside world.

Of course, but here we discuss the internal operations, not the I/O.
Each I/O channel (including system-calls) needs to be marked by the
translation used.

> > ord('A') should be the same on all the systems, unless use locale or
> 
> It isn't.
> In EBCDIC that produces 0xC1, or 193.

As I said, in EBCDIC you have an implicit "use locale" around your script.

> > somesuch is in effect.  EBCDIC ports may behave as if they have an
> 
> This means that you want to impose ISO Latin 1 on everyone in 8-bit world.

Unless 'use locale' is in effect.  This is exactly what we have now.

> > implicit 'use locale' around each script.
> 
> 'use locale' has *nothing* to do with this.

You err.

> > [locales are just ways to assign a different cultural information to
> >  integers (=characters).  As Larry said, Perl should allow one use
> 
> I wish they where -- but they are not.  That's not how they have been
> (very weakly defined by standards and (badly) implemented by vendors.
> For one thing, they have very little to with character encodings.

Here I discuss "locales as seen from Perl", not something else.

> >  big5 for an internal cultural-info tables instead of unicode.
> >  Similarly, 'use locale' just loads a different table into the range
> >  0..255.  {BTW, It may make sense to make the "Unicode 0..255 range"
> 
> Sorry, Ilya, that's completely not what happens.

How so?  (Unless you consider collation - which is not "completely
not" either.)

> > A string *must* be marked utf8 if was utf8-encoded and contained chars
> > above 127.  A string *may* be marked utf8 if it byte-encoded, but does
> > not contain chars above 127.
> 
> Your sentence is in opposition with our existing Unicode model and
> implementation, which seems to be working rather nicely, so you must
> have a complete alternative implementation in your backpocket.

Please explain how having a string marked as utf8 and with PVX="a"
"opposes" your model.

> Your sentence is essentially saying that utf8-marking is a hint (that
> might be false) that it the string might contain chars above 127,
> instead of the current implementation where it is a guarantee of that.

> Unsurprisingly, I find the current model much cleaner.

Unsurprisingly, I do not.  You need an extra scan on each string
operation to (sometimes) switch off utf8-bit.  Switching it off gives
no visible semantic changes, and is quite time-consuming.

It may also significantly slow down (or significantly speed up) the
following operations over this SV - but I would prefer to consider
semantic changes separately from performance issues (especially for
such unclear performance corollaries).

The "correct" model would use two bits for encoding: "PVX contains a
sequence of byte-encoded chars", "PVX contains a sequence of
utf8-encoded chars".  The strings with only chars 0..127 (in
the "canonical" representation) would be marked as both.

In the current model the flag is *used* to distinguish things which
need some massage when converting to byte-strings and utf8-strings.
It does not make a lot of sense to have the "informal meaning" of the
flag so distant from "the meaning of the flag when used".

Ilya



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About