develooper Front page | perl.perl5.porters | Postings from February 2001

Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))

Nick Ing-Simmons
February 21, 2001 02:02
Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))
Message ID:
Simon Cozens <> writes:
>OK. Let me try and finally explain what I propose to do with EBCDIC.
>Perl, on most non-EBCDIC platforms, happily assumes that the world is Latin1.
>Or LatinX - it doesn't matter. It only becomes significant when Unicode
>strings are introduced into a Perl program. When LatinX and Unicode strings
>meet, Perl assumes that the non-Unicode string is Latin1 and upgrades it to
>Unicode. If it isn't Latin1, then we have another problem which can be solved
>another time, and probably with :encode(LatinX). However, if we don't
>introduce Unicode strings into the equation, then LatinX can continue being
>LatinX and people doesn't actually need to care that LatinX is not the first
>255 characters of the Unicode standard, that is, Latin 1.
>I want to extend this idea to EBCDIC. If you throw around a bunch of EBCDIC
>strings, fine. You don't need to care about that, and Perl will continue to
>operate in the way that it always has done. If you introduce a Unicode string
>into the equation, then things get tricky. Just like with LatinX, Perl will
>upgrade that string to Unicode, passing it through a filter which turns EBCDIC
>code points into Unicode code points. Then you have a bunch of Unicode
>strings, and you're back to the model above. No problem. So you have:
>    LatinX codepoints + LatinX  -> LatinX
>    LatinX codepoints + Unicode -> Upgrade LatinX (as Latin1) to Unicode
>    EBCDIC codepoints + EBCDIC  -> EBCDIC
>    EBCDIC codepoints + Unicode -> Upgrade EBCDIC (via filter) to Unicode
>You can see the parallel? It's very easy. If the LatinX model works, then the
>EBCDIC model works.

That has been my assumption recently - that is:

  chr(0)..chr(255) - 'byte-able' has EBCDIC "cultural info" when SvUTF8_off
                     (isalpha, tolower etc.)
                   - to upgrade do e2a[ch] and SvUTF8_on
                     e2a array is equivalent to 
                   - e.g. chr(0xC1) can be C 0xC1,SvUTF8_off
                     or  0x41,SvUTF8_on
  chr(256)...      - only as UTF8 - uses Unicode code points.

Said yet another way - this still "transparent" in the Ilya sense,
it is just that the semantics of the numbers 0..255 are "scrambled" 
compared to Unicode code points.

I am also assuming that pack('U',...) implies Unicode code points, 
while pack('C',...) has legacy EBCDIC nature (to match ord/chr).
so that on EBCDIC 

         pack(U,0x41) eq pack('C',0xC1) 

With the above defintion Encode can do its thing on EBCDIC by calling 
sv_utf8_upgrade() and then proceeding to index the tables with resulting 
UTF-8 encoded bytes - which is what it normally does.

There is an internals feature "hidden" in the above which has presumably 
been fixed by now. When we upgrade 0xC1 we get 0x41 - with no high bits 
- but we must still set SvUTF8_on.

>The only spanner in the, um, works is v-strings. The problem with v-strings is
>that they expect the Unicode code point x to be the same as chr(x), which
>isn't the case for EBCDIC, because the lower 255 codepoints are *not* the same
>as EBCDIC and they are for Latin 1. Hence v5.6.0 means something different on
>EBCDIC as it does on LatinX. 

v-strings are new. They have no legacy, we can define them to mean 
whatever we like - we just need to decide what we like.
As we are defining the works we can define them "to have a spanner just here".
With luck we can make spanner look like a useful lever...

You seem to be implying that identity we "like" is:

    v5.6.0 eq pack('C',5,6,0)

perhaps we change our minds and make that what I thought it was i.e.

    v5.6.0 eq pack('U',5,6,0)

Given transparency the two are identical on Latin1 so we have not broken
anything there.

We need to understand what we want v-strings _for_ and what Camel-III 
or whatever has said about them. 

For example:
If something has said that v127.0.0.1 is passable to socket as 'localhost'
we may need an sv_downgrade_latin1() which does NOT run result through a2e[]
and extend "transparency" to socket() by calling that there.

>This is basically what I'm trying to fix when I
>get my access to an EBCDIC machine - distinguishing between those functions
>which use Unicode for numbers and for strings.
>I think that's about it.
Nick Ing-Simmons <>
Via, but not speaking for: Texas Instruments Ltd. Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About