develooper Front page | perl.perl5.porters | Postings from February 2001

Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))

Ilya Zakharevich
February 20, 2001 21:43
Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))
Message ID:
On Wed, Feb 21, 2001 at 12:45:07AM +0000, Simon Cozens wrote:
> Perl, on most non-EBCDIC platforms, happily assumes that the world is Latin1.

Unless 'use locale'.

> I want to extend this idea to EBCDIC. If you throw around a bunch of EBCDIC
> strings, fine. You don't need to care about that, and Perl will continue to
> operate in the way that it always has done. If you introduce a Unicode string
> into the equation, then things get tricky. Just like with LatinX, Perl will
> upgrade that string to Unicode, passing it through a filter which turns EBCDIC
> code points into Unicode code points.

Again, this is a particular case of 'use locale' situation.

> You can see the parallel? It's very easy. If the LatinX model works, then the
> EBCDIC model works.

Can you please remove LatinX from your description.  It confuses me...
Do you mean "any locale"?

> The only spanner in the, um, works is v-strings.

v-thingies are one large problem anyway.  I do not have a slightest
idea *why* such an abomination made it into Perl...

> The problem with v-strings is
> that they expect the Unicode code point x to be the same as chr(x), which
> isn't the case for EBCDIC, because the lower 255 codepoints are *not* the same
> as EBCDIC and they are for Latin 1.

Nope.  It is not that you break v-thingies.  You broke the fundamental
relationship that ord() is transparent w.r.t. byte/utf8
transmogrifations.  This is a no-no-no.

The solution is as I proposed.  I repeat it:

  'use locale' (or working on a EBCDIC machine) switches the table of
  cultural info associated to integers in the range 0..255.

That's all.  [Well, if you use big-5 locale, then you need to switch
things in the larger region...]

The only problem with this is how to reuse existing (??? do they exist
already?) i/o filters which assume translation-to-Unicode.  Two things
are needed:

 a) knowledge how to translate locale->Unicode (so recognition of
    which Unicode points move into 0..255 rage);

 b) a way to reach Unicode points which were in 0..255, but are no more;

(a) is needed anyway for non-use-locale i/o filters, and to solve (b)
I propose to "duplicate" the whole Unicode set outside of UTF-8 range
(but inside utf8 range), say, starting at 80000000.

Ilya Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About