Ilya Zakharevich <ilya@math.ohio-state.edu> writes: > >> The big question mark is what we (well "they" actually) do on EBCDIC >> platforms where it has been demonstrated that ord('A') == 0xC1 is >> a requirement (if only because it is used as a test for "this is an EBCDIC >> platform"). > >I have no slightest idea what you are talking about. Don't worry about it - unless you need perl on native EBCDIC machine it is a don't care. >What is A? 'A' is whatever script reading process and toke.c think it is. I meant what I said #!perl exit( ord('A') == 0xC1 ? 0 : 1 ) __END__ must exit 0 on EBCDIC. >You >mean the byte 0xC1 on disk which happens to belong to a file-system >representation of a Perl script? Unless things get translated on the way in yes. >Of course if I do > > print FOO "\xC1"; > $a = <FOO>; > >then ord($a) should be 0xC1. The DATA handle is not any way more >special than FOO. I agree there. But on EBCDIC print FOO "\xC1"; $a = <FOO>; die unless lc($a) eq 'a'; mustn't die, etc. etc. > >I think the real problem with understanding of how EBCDIC maps to >other Perl concepts is in thinking that Perl strings have something >else than "numbers with attached cultural info". For Perl, there is >no notion of character 'A'. All Perl knowns is how to case-convert >"numbers", which "numbers" match \w, \d etc, which strings constitute >keywords (sorting is a little bit more complicated). But at the script level the 3 character sequence 'A' does have a meaning. It would have been possible to transform 0xC1 on disc to U+0041 as seen by toke.c (e.g. with an implicit :encoding(cp1047) on DATA handle) but then the above requirements (to make old scripts work) would be very messy. So they don't do that, toke.c sees '\xC1', the internal "byte" form has numbers 0 .. 255 having their EBCDIC "cultrural info" and so on. > >This info can be switched in two ways: by 'use locale', and by being >on EBCDIC. Our locale story is no where near as good as our Unicode story. But that is mostly the fault of under-specified locale semantics at system level. Switching on EBCDIC-ness is cleaner. >Maybe in the future one can switch it also by 'use big5' >(as opposed to the default 'use unicode'). In some sense the default is 'use iso8859_1' in that until told otherwise perl assumes that raw bytes are U+0000..U+00FF, but I see what you mean. As far as I am aware use utf8; still has semantic that it says the script itself is assumed to come from a UTF-8 encoded source file. big5 has other problems in that it is a multi-byte encoding - and you cannot reversibly translate it to Unicode and back - but we don't need to worry about that yet. > >> Everything is supposed to be "transparent", we have the module, >> the masocists have their 'use bytes', let us just fix the bugs and docs >> and release it. > >What remains it to convince Jarkko that we already are 99,9% there; >and make sure that making 'use bytes' work *is not our target*. > >If it works as people expect, it is OK. If it does not, tough luck. >It is not documented how it works anyway. If some change we *need* to >make things transparent breaks some operation of 'use bytes', off this >operation goes... -- Nick Ing-Simmons <nik@tiuk.ti.com> Via, but not speaking for: Texas Instruments Ltd.