develooper Front page | perl.perl5.porters | Postings from February 2007

Re: Future Perl development

Thread Previous | Thread Next
From:
mark
Date:
February 6, 2007 19:03
Subject:
Re: Future Perl development
Message ID:
20070207030257.GA24948@mark.mielke.cc
On Wed, Feb 07, 2007 at 01:56:09AM +0100, Gerard Goossen wrote:
> I would suggest to make the UTF-EBCDIC the representation in Perl7 on
> EBCDIC platforms, regardless of what is in the string.

Why? Only performance? Why is UTF-EBCDIC not frequently used any longer,
and why should Perl buck that trend?

> UTF-EBCDIC is an encoding of UNICODE. But a strange one in the sence
> that bytes do NOT correspond to codepoints for codepoints < 0x7F. But
> the bytes do correspond.
> If you want have codepoint U+0041 (ASCII 'A') this would be UTF-EBCDIC
> encoded as 0xC1. Using EBCDIC encoding 0xC1 would also be an 'A'.
> So although with the codepoints are not the same with UNICODE and
> EBCDIC, using UTF-EBCDIC the bytes are.

What remains is that EBCDIC, or UTF-EBCDIC, is not UNICODE. A second
translation phase is required to convert EBCDIC/UNICODE to UNICODE.

> Like you my initial my initial idea was also to use UTF-8 on EBCDIC
> platforms, but SADAHIRO pointed out, that on EBCDIC platform '\n' in C
> would not generate a LF in UTF-EBCDIC, but not in a LF in UTF-8.

The '\n' vs '\r' is a large issue on its own that has little do with
UNICODE. I believe Mac Perl has actually swapped '\n' and '\r'.

I don't believe that the native end-of-line character is a reason to
choose UTF-8 vs UTF-EBCDIC.

> sub identity {
>   my $string = shift;
>   my ($sock1, $sock2);
>   socketpair($sock1, $sock2, AF_UNIX, SOCK_STREAM, PF_UNSPEC) or die;
>   $sock1->print($string);
>   $sock1->close;
>   local $/ = undef; # slurp mode
>   return $sock2->getline();
> }
> 
> I don't care whether $string is a text-string or byte-string, I just want 
> it to returns the same string.

Perhaps you should care. In a language such as Java, you are forced to
care, as byte[] and String are different types. Perl blurs this difference,
and lets you believe that you should not need to care.

> The problem with the current Perl 5 is that it uses two encoding latin1
> and UTF-8. The above identity holds in Perl 5, for both text and byte string,
> as long as you don't use any unicode characters, leading to people 
> avoiding unicode :-(

This is more of the same confusion with Perl's implementation.

It scares me too.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About