develooper Front page | perl.perl5.porters | Postings from February 2007

Re: Future Perl development

From:
Gerard Goossen
Date:
February 6, 2007 16:52
Subject:
Re: Future Perl development
Message ID:
20070207005609.GA1868@ostwald
On Tue, Feb 06, 2007 at 01:40:43PM -0500, mark@mark.mielke.cc wrote:
> On Tue, Feb 06, 2007 at 06:27:59PM +0100, Gerard Goossen wrote:
> > > This is not a matter of context, by the way. Instead, the value "\xFF"
> > > is polymorphic. It's both a unicode string representing code point
> > > U+00FF, and the single byte 0xFF.
> > No. \xFF creates a character represented by FF according to the native
> > encoding.
> > If your native encoding is EBCDIC this does NOT correspend to
> > U+00FF (instead it corresponds to U+007E or U+009F, depending on the
> > flavor of EBCDIC you're on).
> > You also assume that \xFF in the native encoding corresponds to a byte
> > You assume (like everybody else) that in the native encoding a
> > character corresponds to a byte with the same numeric value.
> > This assumption is what makes the transition to UTF-8 so difficult,
> > because in the UTF-8 encoding, the assumption is NOT correct. 
> 
> I think are saying that UTF-EBCDIC should be the internal representation
> for strings in Perl on EBCDIC platforms if any characters in the string
> has a value >= 0x80.

I would suggest to make the UTF-EBCDIC the representation in Perl7 on
EBCDIC platforms, regardless of what is in the string.

> If this is what you are saying, then I can see why I, and other people
> cannot understand you. We're not on the same page. I don't believe
> UTF-EBCDIC makes sense, as UTF-EBCDIC is not an encoding of UNICODE.
> It is an encoding of a mix between EBCDIC/UNICODE. Although UTF-8
> is only an encoding scheme, most people assume that the internal
> representation for a language that claims to support UNICODE, should
> be UNICODE, therefore the UTF-8 should be encoding UNICODE code
> points. Not EBCDIC/UNICODE code points.
> 
> Perhaps this would represent a performance degradation for systems
> that use EBCDIC natively? Is this why you would focus on UTF-EBCDIC?

UTF-EBCDIC is an encoding of UNICODE. But a strange one in the sence
that bytes do NOT correspond to codepoints for codepoints < 0x7F. But
the bytes do correspond.
If you want have codepoint U+0041 (ASCII 'A') this would be UTF-EBCDIC
encoded as 0xC1. Using EBCDIC encoding 0xC1 would also be an 'A'.
So although with the codepoints are not the same with UNICODE and
EBCDIC, using UTF-EBCDIC the bytes are.
Like you my initial my initial idea was also to use UTF-8 on EBCDIC
platforms, but SADAHIRO pointed out, that on EBCDIC platform '\n' in C
would not generate a LF in UTF-EBCDIC, but not in a LF in UTF-8.

> Anyways - I've not shared people's opinions that Perl's implementation
> of UNICODE or UTF-8 is excellent. I've avoided it wherever possible.
> I prefer Java's approach or GTK's approach. Java uses UTF-16 internal
> representation, but never confuses internal representation with
> external representation. If portability is of course, this seems
> an excellent approach.

Having a strict seperation is certainly a valid approach, I think it is
much more in the style of Perl to have a transparent conversion from one
type of scalar to another, be it number, text-string or byte-string.
This is also how it used to work when there was only latin1, where
the byte representation would be identical to the numeric value of the
character, leading to people using \xFF to create the byte 0xFF
Back to why I think a transparent conversion would be convenient,
for example when doing:

use Socket;
use IO::Handle;

sub identity {
  my $string = shift;
  my ($sock1, $sock2);
  socketpair($sock1, $sock2, AF_UNIX, SOCK_STREAM, PF_UNSPEC) or die;
  $sock1->print($string);
  $sock1->close;
  local $/ = undef; # slurp mode
  return $sock2->getline();
}

I don't care whether $string is a text-string or byte-string, I just want 
it to returns the same string.
The conversion should not be a problem as long as there is one encoding
which is always used (of course the text-string must be able to be
encoded using this encoding).
The problem with the current Perl 5 is that it uses two encoding latin1
and UTF-8. The above identity holds in Perl 5, for both text and byte string,
as long as you don't use any unicode characters, leading to people 
avoiding unicode :-(


Gerard Goossen




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About