Simon Cozens <simon@simon-cozens.org> writes: >OK. Let me try and finally explain what I propose to do with EBCDIC. > >Perl, on most non-EBCDIC platforms, happily assumes that the world is Latin1. >Or LatinX - it doesn't matter. It only becomes significant when Unicode >strings are introduced into a Perl program. When LatinX and Unicode strings >meet, Perl assumes that the non-Unicode string is Latin1 and upgrades it to >Unicode. If it isn't Latin1, then we have another problem which can be solved >another time, and probably with :encode(LatinX). However, if we don't >introduce Unicode strings into the equation, then LatinX can continue being >LatinX and people doesn't actually need to care that LatinX is not the first >255 characters of the Unicode standard, that is, Latin 1. > >I want to extend this idea to EBCDIC. If you throw around a bunch of EBCDIC >strings, fine. You don't need to care about that, and Perl will continue to >operate in the way that it always has done. If you introduce a Unicode string >into the equation, then things get tricky. Just like with LatinX, Perl will >upgrade that string to Unicode, passing it through a filter which turns EBCDIC >code points into Unicode code points. Then you have a bunch of Unicode >strings, and you're back to the model above. No problem. So you have: > > LatinX codepoints + LatinX -> LatinX > LatinX codepoints + Unicode -> Upgrade LatinX (as Latin1) to Unicode > EBCDIC codepoints + EBCDIC -> EBCDIC > EBCDIC codepoints + Unicode -> Upgrade EBCDIC (via filter) to Unicode > >You can see the parallel? It's very easy. If the LatinX model works, then the >EBCDIC model works. That has been my assumption recently - that is: chr(0)..chr(255) - 'byte-able' has EBCDIC "cultural info" when SvUTF8_off (isalpha, tolower etc.) - to upgrade do e2a[ch] and SvUTF8_on e2a array is equivalent to ext/Encode/Encode/cp1047.ucm - e.g. chr(0xC1) can be C 0xC1,SvUTF8_off or 0x41,SvUTF8_on chr(256)... - only as UTF8 - uses Unicode code points. Said yet another way - this still "transparent" in the Ilya sense, it is just that the semantics of the numbers 0..255 are "scrambled" compared to Unicode code points. I am also assuming that pack('U',...) implies Unicode code points, while pack('C',...) has legacy EBCDIC nature (to match ord/chr). so that on EBCDIC pack(U,0x41) eq pack('C',0xC1) With the above defintion Encode can do its thing on EBCDIC by calling sv_utf8_upgrade() and then proceeding to index the tables with resulting UTF-8 encoded bytes - which is what it normally does. There is an internals feature "hidden" in the above which has presumably been fixed by now. When we upgrade 0xC1 we get 0x41 - with no high bits - but we must still set SvUTF8_on. > >The only spanner in the, um, works is v-strings. The problem with v-strings is >that they expect the Unicode code point x to be the same as chr(x), which >isn't the case for EBCDIC, because the lower 255 codepoints are *not* the same >as EBCDIC and they are for Latin 1. Hence v5.6.0 means something different on >EBCDIC as it does on LatinX. v-strings are new. They have no legacy, we can define them to mean whatever we like - we just need to decide what we like. As we are defining the works we can define them "to have a spanner just here". With luck we can make spanner look like a useful lever... You seem to be implying that identity we "like" is: v5.6.0 eq pack('C',5,6,0) perhaps we change our minds and make that what I thought it was i.e. v5.6.0 eq pack('U',5,6,0) Given transparency the two are identical on Latin1 so we have not broken anything there. We need to understand what we want v-strings _for_ and what Camel-III or whatever has said about them. For example: If something has said that v127.0.0.1 is passable to socket as 'localhost' we may need an sv_downgrade_latin1() which does NOT run result through a2e[] and extend "transparency" to socket() by calling that there. >This is basically what I'm trying to fix when I >get my access to an EBCDIC machine - distinguishing between those functions >which use Unicode for numbers and for strings. > >I think that's about it. -- Nick Ing-Simmons <nik@tiuk.ti.com> Via, but not speaking for: Texas Instruments Ltd.Thread Previous | Thread Next