develooper Front page | perl.perl5.porters | Postings from February 2001

Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))

Thread Previous | Thread Next
From:
Ilya Zakharevich
Date:
February 21, 2001 08:23
Subject:
Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))
Message ID:
20010221112254.A17436@math.ohio-state.edu
On Wed, Feb 21, 2001 at 09:17:58AM +0000, Nick Ing-Simmons wrote:
> >What is A?  
> 
> 'A' is whatever script reading process and toke.c think it is.

So it is 0xC1...

> #!perl
> exit(  ord('A') == 0xC1 ? 0 : 1 )
> __END__
> 
> must exit 0 on EBCDIC.

Of course.

> But on EBCDIC 
> 
>   print FOO "\xC1";
>   $a = <FOO>;
>   die unless lc($a) eq 'a';
> 
> mustn't die, etc. etc. 

... but I see no "but", just "of course".  Your 'a' is read from disk
as 0xE1 (or whatever - do not ), the cultural info table of EBCDIC say
that lc variant of 0xC1 should be 0xE1 etc etc etc.  So it "just
works", the same as things work in any locale.

> It would have been possible to transform 0xC1 on disc to U+0041 as 
> seen by toke.c (e.g. with an implicit :encoding(cp1047) on DATA handle)
> but then the above requirements (to make old scripts work) would 
> be very messy. So they don't do that, toke.c sees '\xC1', the internal 
> "byte" form has numbers 0 .. 255 having their EBCDIC "cultrural info"
> and so on. 

Exactly.  This is *why* I made my proposal: to support 'use locale',
do not translate things to Unicode.  "Translate" the cultural info
table instead.

> Our locale story is no where near as good as our Unicode story.
> But that is mostly the fault of under-specified locale semantics 
> at system level.

No, the faults are at different places:

  a) use locale is lexically scoped, so useless when modules are used;

  b) there were no defined semantic of the interaction of locale and
     Unicode [my proposal creates such a semantic];

> Switching on EBCDIC-ness is cleaner.

There is no difference (as far as Perl is concerned; except for
sorting) between EBCDIC-ness and locale.  If you feel otherwise,
please give an example to unconfuse me.

> use utf8;
> 
> still has semantic that it says the script itself is assumed to come
> from a UTF-8 encoded source file.

use utf8 is a mastodon.  It is not needed for any other purpose, so
let it be so.

> big5 has other problems in that it is a multi-byte encoding

Does not matter: I discuss character mapping here, not encoding.

Ilya

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About