develooper Front page | perl.perl5.porters | Postings from February 2001

Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))

From:
Nick Ing-Simmons
Date:
February 21, 2001 01:19
Subject:
Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))
Message ID:
200102210917.JAA26933@mikado.tiuk.ti.com
Ilya Zakharevich <ilya@math.ohio-state.edu> writes:
>
>> The big question mark is what we (well "they" actually) do on EBCDIC 
>> platforms where it has been demonstrated that ord('A') == 0xC1 is 
>> a requirement (if only because it is used as a test for "this is an EBCDIC 
>> platform").
>
>I have no slightest idea what you are talking about.  

Don't worry about it - unless you need perl on native EBCDIC machine
it is a don't care.

>What is A?  

'A' is whatever script reading process and toke.c think it is.

I meant what I said 

#!perl
exit(  ord('A') == 0xC1 ? 0 : 1 )
__END__

must exit 0 on EBCDIC.


>You
>mean the byte 0xC1 on disk which happens to belong to a file-system
>representation of a Perl script?  

Unless things get translated on the way in yes.

>Of course if I do
>
>  print FOO "\xC1";
>  $a = <FOO>;
>
>then ord($a) should be 0xC1.  The DATA handle is not any way more
>special than FOO.

I agree there.
But on EBCDIC 

  print FOO "\xC1";
  $a = <FOO>;
  die unless lc($a) eq 'a';

mustn't die, etc. etc. 

>
>I think the real problem with understanding of how EBCDIC maps to
>other Perl concepts is in thinking that Perl strings have something
>else than "numbers with attached cultural info".  For Perl, there is
>no notion of character 'A'.  All Perl knowns is how to case-convert
>"numbers", which "numbers" match \w, \d etc, which strings constitute
>keywords (sorting is a little bit more complicated).

But at the script level the 3 character sequence 'A' does have a meaning.
It would have been possible to transform 0xC1 on disc to U+0041 as 
seen by toke.c (e.g. with an implicit :encoding(cp1047) on DATA handle)
but then the above requirements (to make old scripts work) would 
be very messy. So they don't do that, toke.c sees '\xC1', the internal 
"byte" form has numbers 0 .. 255 having their EBCDIC "cultrural info"
and so on. 

>
>This info can be switched in two ways: by 'use locale', and by being
>on EBCDIC.  

Our locale story is no where near as good as our Unicode story.
But that is mostly the fault of under-specified locale semantics 
at system level.

Switching on EBCDIC-ness is cleaner.

>Maybe in the future one can switch it also by 'use big5'
>(as opposed to the default 'use unicode').

In some sense the default is 'use iso8859_1' in that until told otherwise
perl assumes that raw bytes are U+0000..U+00FF, but I see what you mean.

As far as I am aware 

use utf8;

still has semantic that it says the script itself is assumed to come
from a UTF-8 encoded source file.

big5 has other problems in that it is a multi-byte encoding - and 
you cannot reversibly translate it to Unicode and back - but we don't 
need to worry about that yet.

>
>> Everything is supposed to be "transparent", we have the module, 
>> the masocists have their 'use bytes', let us just fix the bugs and docs
>> and release it. 
>
>What remains it to convince Jarkko that we already are 99,9% there;
>and make sure that making 'use bytes' work *is not our target*.
>
>If it works as people expect, it is OK.  If it does not, tough luck.
>It is not documented how it works anyway.  If some change we *need* to
>make things transparent breaks some operation of 'use bytes', off this
>operation goes...

-- 
Nick Ing-Simmons <nik@tiuk.ti.com>
Via, but not speaking for: Texas Instruments Ltd.




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About