develooper Front page | perl.perl5.porters | Postings from February 2001

Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))

From:
Nick Ing-Simmons
Date:
February 21, 2001 02:23
Subject:
Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))
Message ID:
200102211021.KAA27043@mikado.tiuk.ti.com
Ilya Zakharevich <ilya@math.ohio-state.edu> writes:
>On Wed, Feb 21, 2001 at 12:45:07AM +0000, Simon Cozens wrote:
>Can you please remove LatinX from your description.  It confuses me...

Assume he meant latin1/iso-8859-1 - the "cultural info" attached 
to the numbers would be different in other locales.

>Do you mean "any locale"?
>
>> The only spanner in the, um, works is v-strings.
>
>v-thingies are one large problem anyway.  I do not have a slightest
>idea *why* such an abomination made it into Perl...
>
>> The problem with v-strings is
>> that they expect the Unicode code point x to be the same as chr(x), which
>> isn't the case for EBCDIC, because the lower 255 codepoints are *not* the same
>> as EBCDIC and they are for Latin 1.
>
>Nope.  It is not that you break v-thingies.  You broke the fundamental
>relationship that ord() is transparent w.r.t. byte/utf8

The thing to understand about EBCDIC perl in the Simon model is that 
the bottom 256 numbers have been transformed. It does not use Unicode
code points but a different space which has a one-to-one mapping to 
that space. The transparency is retained but in that different space. 

>transmogrifations.  This is a no-no-no.
>
>The solution is as I proposed.  I repeat it:
>
>  'use locale' (or working on a EBCDIC machine) switches the table of
>  cultural info associated to integers in the range 0..255.

That is essentially what Simon's scheme does - we are all in 
"violent agreement" again ;-)

>
>That's all.  [Well, if you use big-5 locale, then you need to switch
>things in the larger region...]
>
>The only problem with this is how to reuse existing (??? do they exist
>already?) i/o filters which assume translation-to-Unicode.  Two things
>are needed:

>
> a) knowledge how to translate locale->Unicode (so recognition of
>    which Unicode points move into 0..255 rage);

The a2e/e2a tables just permute the 0..255 range, they don't add/remove
any points. Outside that range that transform is an identity.

So the transparency is achieved by having two representations (as ever)
  
   byte - cultural info from native EBCDIC
   utf8 - transform and then hold as UTF-8 Unicode - cultural info from 
          Unicode db.

>
> b) a way to reach Unicode points which were in 0..255, but are no more;

   pack('C',...) gives access to the EBCDIC-space  (locale space if you must)
   pack('U',...) gives access to the Unicode space

Thus pack('U',0x41) can be held "transparently" as 

   0x41,SvUTF8_on      (Upper-case-'A'-ness from Unicode)
   0xC1,SvUTF8_off     (Upper-case-'A'-ness from EBCDIC)


>
>(a) is needed anyway for non-use-locale i/o filters, and to solve (b)
>I propose to "duplicate" the whole Unicode set outside of UTF-8 range
>(but inside utf8 range), say, starting at 80000000.
>
>Ilya
-- 
Nick Ing-Simmons <nik@tiuk.ti.com>
Via, but not speaking for: Texas Instruments Ltd.




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About