develooper Front page | perl.perl5.porters | Postings from February 2001

Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))

From:
Simon Cozens
Date:
February 20, 2001 16:45
Subject:
Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))
Message ID:
20010221004507.B9430@pembro26.pmb.ox.ac.uk
On Tue, Feb 20, 2001 at 09:53:09PM +0000, nick@ing-simmons.net wrote:
> The big question mark is what we (well "they" actually) do on EBCDIC 
> platforms where it has been demonstrated that ord('A') == 0xC1 is 
> a requirement (if only because it is used as a test for "this is an EBCDIC 
> platform").  Simon and Peter have made much progress in this area
> but they have not fully explained it yet.

OK. Let me try and finally explain what I propose to do with EBCDIC.

Perl, on most non-EBCDIC platforms, happily assumes that the world is Latin1.
Or LatinX - it doesn't matter. It only becomes significant when Unicode
strings are introduced into a Perl program. When LatinX and Unicode strings
meet, Perl assumes that the non-Unicode string is Latin1 and upgrades it to
Unicode. If it isn't Latin1, then we have another problem which can be solved
another time, and probably with :encode(LatinX). However, if we don't
introduce Unicode strings into the equation, then LatinX can continue being
LatinX and people doesn't actually need to care that LatinX is not the first
255 characters of the Unicode standard, that is, Latin 1.

I want to extend this idea to EBCDIC. If you throw around a bunch of EBCDIC
strings, fine. You don't need to care about that, and Perl will continue to
operate in the way that it always has done. If you introduce a Unicode string
into the equation, then things get tricky. Just like with LatinX, Perl will
upgrade that string to Unicode, passing it through a filter which turns EBCDIC
code points into Unicode code points. Then you have a bunch of Unicode
strings, and you're back to the model above. No problem. So you have:

    LatinX codepoints + LatinX  -> LatinX
    LatinX codepoints + Unicode -> Upgrade LatinX (as Latin1) to Unicode
    EBCDIC codepoints + EBCDIC  -> EBCDIC
    EBCDIC codepoints + Unicode -> Upgrade EBCDIC (via filter) to Unicode

You can see the parallel? It's very easy. If the LatinX model works, then the
EBCDIC model works.

The only spanner in the, um, works is v-strings. The problem with v-strings is
that they expect the Unicode code point x to be the same as chr(x), which
isn't the case for EBCDIC, because the lower 255 codepoints are *not* the same
as EBCDIC and they are for Latin 1. Hence v5.6.0 means something different on
EBCDIC as it does on LatinX. This is basically what I'm trying to fix when I
get my access to an EBCDIC machine - distinguishing between those functions
which use Unicode for numbers and for strings.

I think that's about it.
-- 
A witty saying means nothing.  -Voltaire



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About