develooper Front page | perl.perl5.porters | Postings from February 2001

Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)

Ilya Zakharevich
February 16, 2001 12:53
Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)
Message ID:
On Fri, Feb 16, 2001 at 08:44:26AM -0600, Jarkko Hietaniemi wrote:
> > Given that you cannot distinguish byte-encoded strings from
> > utf8-encoded strings from Perl code, I fail to see any difference
> Yes, you can.  unpack("C*", ...), unless, of course you intend to
> change that, too.

There should be no operations in the core which expose the internal
representation.  Only modules - like Devel::Peek.  The &|^ disaster
should have taught us this.

> { use bytes; length() }.  unpack("U*") will croak
> if you feed it malformed UTF-8.

There should be no 'use bytes'.

> > Given transparency, you do not need such a thing.
> Wrong.  There are standards and protocols, and other pieces of
> software, out there that *require* producing UTF-8.  IIRC LDAP
> is one of those outside bits.  Java would be another [*].
> Perl must be able to interface with the outside world.

Of course, but here we discuss the internal operations, not the I/O.
Each I/O channel (including system-calls) needs to be marked by the
translation used.

> > ord('A') should be the same on all the systems, unless use locale or
> It isn't.
> In EBCDIC that produces 0xC1, or 193.

As I said, in EBCDIC you have an implicit "use locale" around your script.

> > somesuch is in effect.  EBCDIC ports may behave as if they have an
> This means that you want to impose ISO Latin 1 on everyone in 8-bit world.

Unless 'use locale' is in effect.  This is exactly what we have now.

> > implicit 'use locale' around each script.
> 'use locale' has *nothing* to do with this.

You err.

> > [locales are just ways to assign a different cultural information to
> >  integers (=characters).  As Larry said, Perl should allow one use
> I wish they where -- but they are not.  That's not how they have been
> (very weakly defined by standards and (badly) implemented by vendors.
> For one thing, they have very little to with character encodings.

Here I discuss "locales as seen from Perl", not something else.

> >  big5 for an internal cultural-info tables instead of unicode.
> >  Similarly, 'use locale' just loads a different table into the range
> >  0..255.  {BTW, It may make sense to make the "Unicode 0..255 range"
> Sorry, Ilya, that's completely not what happens.

How so?  (Unless you consider collation - which is not "completely
not" either.)

> > A string *must* be marked utf8 if was utf8-encoded and contained chars
> > above 127.  A string *may* be marked utf8 if it byte-encoded, but does
> > not contain chars above 127.
> Your sentence is in opposition with our existing Unicode model and
> implementation, which seems to be working rather nicely, so you must
> have a complete alternative implementation in your backpocket.

Please explain how having a string marked as utf8 and with PVX="a"
"opposes" your model.

> Your sentence is essentially saying that utf8-marking is a hint (that
> might be false) that it the string might contain chars above 127,
> instead of the current implementation where it is a guarantee of that.

> Unsurprisingly, I find the current model much cleaner.

Unsurprisingly, I do not.  You need an extra scan on each string
operation to (sometimes) switch off utf8-bit.  Switching it off gives
no visible semantic changes, and is quite time-consuming.

It may also significantly slow down (or significantly speed up) the
following operations over this SV - but I would prefer to consider
semantic changes separately from performance issues (especially for
such unclear performance corollaries).

The "correct" model would use two bits for encoding: "PVX contains a
sequence of byte-encoded chars", "PVX contains a sequence of
utf8-encoded chars".  The strings with only chars 0..127 (in
the "canonical" representation) would be marked as both.

In the current model the flag is *used* to distinguish things which
need some massage when converting to byte-strings and utf8-strings.
It does not make a lot of sense to have the "informal meaning" of the
flag so distant from "the meaning of the flag when used".

Ilya Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About