Ilya Zakharevich <ilya@math.ohio-state.edu> writes: >On Wed, Feb 21, 2001 at 12:45:07AM +0000, Simon Cozens wrote: >Can you please remove LatinX from your description. It confuses me... Assume he meant latin1/iso-8859-1 - the "cultural info" attached to the numbers would be different in other locales. >Do you mean "any locale"? > >> The only spanner in the, um, works is v-strings. > >v-thingies are one large problem anyway. I do not have a slightest >idea *why* such an abomination made it into Perl... > >> The problem with v-strings is >> that they expect the Unicode code point x to be the same as chr(x), which >> isn't the case for EBCDIC, because the lower 255 codepoints are *not* the same >> as EBCDIC and they are for Latin 1. > >Nope. It is not that you break v-thingies. You broke the fundamental >relationship that ord() is transparent w.r.t. byte/utf8 The thing to understand about EBCDIC perl in the Simon model is that the bottom 256 numbers have been transformed. It does not use Unicode code points but a different space which has a one-to-one mapping to that space. The transparency is retained but in that different space. >transmogrifations. This is a no-no-no. > >The solution is as I proposed. I repeat it: > > 'use locale' (or working on a EBCDIC machine) switches the table of > cultural info associated to integers in the range 0..255. That is essentially what Simon's scheme does - we are all in "violent agreement" again ;-) > >That's all. [Well, if you use big-5 locale, then you need to switch >things in the larger region...] > >The only problem with this is how to reuse existing (??? do they exist >already?) i/o filters which assume translation-to-Unicode. Two things >are needed: > > a) knowledge how to translate locale->Unicode (so recognition of > which Unicode points move into 0..255 rage); The a2e/e2a tables just permute the 0..255 range, they don't add/remove any points. Outside that range that transform is an identity. So the transparency is achieved by having two representations (as ever) byte - cultural info from native EBCDIC utf8 - transform and then hold as UTF-8 Unicode - cultural info from Unicode db. > > b) a way to reach Unicode points which were in 0..255, but are no more; pack('C',...) gives access to the EBCDIC-space (locale space if you must) pack('U',...) gives access to the Unicode space Thus pack('U',0x41) can be held "transparently" as 0x41,SvUTF8_on (Upper-case-'A'-ness from Unicode) 0xC1,SvUTF8_off (Upper-case-'A'-ness from EBCDIC) > >(a) is needed anyway for non-use-locale i/o filters, and to solve (b) >I propose to "duplicate" the whole Unicode set outside of UTF-8 range >(but inside utf8 range), say, starting at 80000000. > >Ilya -- Nick Ing-Simmons <nik@tiuk.ti.com> Via, but not speaking for: Texas Instruments Ltd.Thread Previous | Thread Next