develooper Front page | perl.perl5.porters | Postings from February 2001

Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))

From:
Ilya Zakharevich
Date:
February 20, 2001 15:38
Subject:
Re: Perl-Unicode fundamentals (was Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2))
Message ID:
20010220183754.C13792@math.ohio-state.edu
On Tue, Feb 20, 2001 at 09:53:09PM +0000, nick@ing-simmons.net wrote:
> I don't think that is what Jarkko meant at all. 
> Transparency may no longer be a goal - because we already have it!
> (well almost... a bug or two remains).

Well, then he should express himself better...

> >The choice of internal representation may matter performance-wise, but
> >this may be addressed with pragmas.  [BTW, if I correctly understood
> >what Jarkko was insinuating, we have no choice now: if a string
> >contains only logical chars <256, it is *forced* to bytes...  I hope
> >I'm wrong...]
> 
> You are wrong. chop($str.chr(256)) leaves the result in UTF-8 form - 
> I used it earlier to show the unpack('C',$str) != ord($str) "bug".

Again, then Jarkko needs to express himself better.  He objected that

  my $empty = substr "\x{101}", 1;
  sub qu ($) { shift . $empty}

will achieve what the current qu"something" does.

> The big question mark is what we (well "they" actually) do on EBCDIC 
> platforms where it has been demonstrated that ord('A') == 0xC1 is 
> a requirement (if only because it is used as a test for "this is an EBCDIC 
> platform").

I have no slightest idea what you are talking about.  What is A?  You
mean the byte 0xC1 on disk which happens to belong to a file-system
representation of a Perl script?  Of course if I do

  print FOO "\xC1";
  $a = <FOO>;

then ord($a) should be 0xC1.  The DATA handle is not any way more
special than FOO.

I think the real problem with understanding of how EBCDIC maps to
other Perl concepts is in thinking that Perl strings have something
else than "numbers with attached cultural info".  For Perl, there is
no notion of character 'A'.  All Perl knowns is how to case-convert
"numbers", which "numbers" match \w, \d etc, which strings constitute
keywords (sorting is a little bit more complicated).

This info can be switched in two ways: by 'use locale', and by being
on EBCDIC.  Maybe in the future one can switch it also by 'use big5'
(as opposed to the default 'use unicode').

> Everything is supposed to be "transparent", we have the module, 
> the masocists have their 'use bytes', let us just fix the bugs and docs
> and release it. 

What remains it to convince Jarkko that we already are 99,9% there;
and make sure that making 'use bytes' work *is not our target*.

If it works as people expect, it is OK.  If it does not, tough luck.
It is not documented how it works anyway.  If some change we *need* to
make things transparent breaks some operation of 'use bytes', off this
operation goes...

Ilya



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About