develooper Front page | perl.perl5.porters | Postings from February 2001

Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)

From:
Jarkko Hietaniemi
Date:
February 16, 2001 14:17
Subject:
Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)
Message ID:
20010216161711.G9171@chaos.wustl.edu
> > Yes, you can.  unpack("C*", ...), unless, of course you intend to
> > change that, too.
> 
> There should be no operations in the core which expose the internal
> representation.  Only modules - like Devel::Peek.

Ilya, I'm starting to think that we are so far from agreement in our
Unicode and locale issues that we generating more heat than light.

You have some strong obvious objections to both, I as the pumpkin am
defending the current model and implementation because it seems to
work, and agree with what the Camel III says about the matter.  If you
don't like what the Camel says, I can't help you.  If you prospose
some overarching rewrite of both, I can't help you (unless you show me
the code).  I have more than enough to do in applying patches and
closing/fixing smaller bugs, I do not have the time to redesign all
that your seem to dislike.

> The &|^ disaster should have taught us this.

The &|^ disaster that is so well-known.  What are you talking about?

> > { use bytes; length() }.  unpack("U*") will croak
> > if you feed it malformed UTF-8.
> 
> There should be no 'use bytes'.

See above.

> > > Given transparency, you do not need such a thing.
> > 
> > Wrong.  There are standards and protocols, and other pieces of
> > software, out there that *require* producing UTF-8.  IIRC LDAP
> > is one of those outside bits.  Java would be another [*].
> > Perl must be able to interface with the outside world.
> 
> Of course, but here we discuss the internal operations, not the I/O.
> Each I/O channel (including system-calls) needs to be marked by the
> translation used.

Including cases where one single I/O channel needs to carry both
8-but data and UTF-8.

> > > ord('A') should be the same on all the systems, unless use locale or
> > 
> > It isn't.
> > In EBCDIC that produces 0xC1, or 193.
> 
> As I said, in EBCDIC you have an implicit "use locale" around your script.

Wrong.  There is no such thing in the EBCDIC implementations of Perl today.
If you are talking something new, your are not talking about 'use locale'.

> > > somesuch is in effect.  EBCDIC ports may behave as if they have an
> > 
> > This means that you want to impose ISO Latin 1 on everyone in 8-bit world.
> 
> Unless 'use locale' is in effect.  This is exactly what we have now.

See my previous paragraph.

> > > implicit 'use locale' around each script.
> > 
> > 'use locale' has *nothing* to do with this.
> 
> You err.

See my next paragraph.

> > > [locales are just ways to assign a different cultural information to
> > >  integers (=characters).  As Larry said, Perl should allow one use
> > 
> > I wish they where -- but they are not.  That's not how they have been
> > (very weakly defined by standards and (badly) implemented by vendors.
> > For one thing, they have very little to with character encodings.
> 
> Here I discuss "locales as seen from Perl", not something else.

The current implementation of locales in Perl is tightly tied to the
(regrettably non-standard and broken) implementation of locales in
vendors' lib(c)s.  The locale implementation you seem to be
referring to does not exist, not supported by the vendors nor
implemented in the Perl, so I have hard time commenting on
what you are saying.

If you are suggesting some new better implementation of the locale
concept, I 'm all for it, I've always seen the current implementation
by the vendors is irreparably broken (e.g IBM's ICU is very promising,
once their lawyers get their act together) for several reasons.

But that new thing can't be 'use locale' any more, not in Perl5.

> > >  big5 for an internal cultural-info tables instead of unicode.
> > >  Similarly, 'use locale' just loads a different table into the range
> > >  0..255.  {BTW, It may make sense to make the "Unicode 0..255 range"
> > 
> > Sorry, Ilya, that's completely not what happens.
> 
> How so?  (Unless you consider collation - which is not "completely
> not" either.)

'use locale' does no "table loading into the 0..255 range".

> > > A string *must* be marked utf8 if was utf8-encoded and contained chars
> > > above 127.  A string *may* be marked utf8 if it byte-encoded, but does
> > > not contain chars above 127.
> > 
> > Your sentence is in opposition with our existing Unicode model and
> > implementation, which seems to be working rather nicely, so you must
> > have a complete alternative implementation in your backpocket.
> 
> Please explain how having a string marked as utf8 and with PVX="a"
> "opposes" your model.

It does not, but the problem lies in the part "might have utf8 mark if
byte-encoded, but does not contain chars above 127".  How does that
utf8 mark get in there, and how does get it clered?  Do we set/clear
on all input strings?  We cannot set it if there are any high-bit
bytes, and we must clear it if the string gets modified and such
high-bit bytes are introduced.

> > Your sentence is essentially saying that utf8-marking is a hint (that
> > might be false) that it the string might contain chars above 127,
> > instead of the current implementation where it is a guarantee of that.
> 
> > Unsurprisingly, I find the current model much cleaner.
> 
> Unsurprisingly, I do not.  You need an extra scan on each string
> operation to (sometimes) switch off utf8-bit.  Switching it off gives
> no visible semantic changes, and is quite time-consuming.
> 
> It may also significantly slow down (or significantly speed up) the
> following operations over this SV - but I would prefer to consider
> semantic changes separately from performance issues (especially for
> such unclear performance corollaries).
> 
> The "correct" model would use two bits for encoding: "PVX contains a
> sequence of byte-encoded chars", "PVX contains a sequence of
> utf8-encoded chars".  The strings with only chars 0..127 (in
> the "canonical" representation) would be marked as both.

From quick reading sounds reasonable and feasible, but see my first
paragraphs.  Getting the Unicode anywhere near working as it is now
has taken us two years and seven months (I'm counting from 5.005_50),
and about two different models & implementations.  Now you are
proposing a third one.

> In the current model the flag is *used* to distinguish things which
> need some massage when converting to byte-strings and utf8-strings.
> It does not make a lot of sense to have the "informal meaning" of the
> flag so distant from "the meaning of the flag when used".
> 
> Ilya


-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About