develooper Front page | perl.perl5.porters | Postings from February 2007

Re: Future Perl development

February 6, 2007 19:03
Re: Future Perl development
Message ID:
On Wed, Feb 07, 2007 at 01:56:09AM +0100, Gerard Goossen wrote:
> I would suggest to make the UTF-EBCDIC the representation in Perl7 on
> EBCDIC platforms, regardless of what is in the string.

Why? Only performance? Why is UTF-EBCDIC not frequently used any longer,
and why should Perl buck that trend?

> UTF-EBCDIC is an encoding of UNICODE. But a strange one in the sence
> that bytes do NOT correspond to codepoints for codepoints < 0x7F. But
> the bytes do correspond.
> If you want have codepoint U+0041 (ASCII 'A') this would be UTF-EBCDIC
> encoded as 0xC1. Using EBCDIC encoding 0xC1 would also be an 'A'.
> So although with the codepoints are not the same with UNICODE and
> EBCDIC, using UTF-EBCDIC the bytes are.

What remains is that EBCDIC, or UTF-EBCDIC, is not UNICODE. A second
translation phase is required to convert EBCDIC/UNICODE to UNICODE.

> Like you my initial my initial idea was also to use UTF-8 on EBCDIC
> platforms, but SADAHIRO pointed out, that on EBCDIC platform '\n' in C
> would not generate a LF in UTF-EBCDIC, but not in a LF in UTF-8.

The '\n' vs '\r' is a large issue on its own that has little do with
UNICODE. I believe Mac Perl has actually swapped '\n' and '\r'.

I don't believe that the native end-of-line character is a reason to
choose UTF-8 vs UTF-EBCDIC.

> sub identity {
>   my $string = shift;
>   my ($sock1, $sock2);
>   socketpair($sock1, $sock2, AF_UNIX, SOCK_STREAM, PF_UNSPEC) or die;
>   $sock1->print($string);
>   $sock1->close;
>   local $/ = undef; # slurp mode
>   return $sock2->getline();
> }
> I don't care whether $string is a text-string or byte-string, I just want 
> it to returns the same string.

Perhaps you should care. In a language such as Java, you are forced to
care, as byte[] and String are different types. Perl blurs this difference,
and lets you believe that you should not need to care.

> The problem with the current Perl 5 is that it uses two encoding latin1
> and UTF-8. The above identity holds in Perl 5, for both text and byte string,
> as long as you don't use any unicode characters, leading to people 
> avoiding unicode :-(

This is more of the same confusion with Perl's implementation.

It scares me too.


-- / /     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                  Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About