develooper Front page | perl.perl5.porters | Postings from February 2007

Re: Future Perl development

From:
Gerard Goossen
Date:
February 5, 2007 13:31
Subject:
Re: Future Perl development
Message ID:
20070205213510.GE9642@ostwald
On Mon, Feb 05, 2007 at 02:56:09PM -0500, mark@mark.mielke.cc wrote:
> On Mon, Feb 05, 2007 at 08:39:50PM +0100, Gerard Goossen wrote:
> > Sometimes you need have a byte-string. But \x.. generates a character.
> > In Perl 5 \xFF generates a byte. But if your target encoding is UTF-8,
> > \xFF generates two bytes. And there is no way to insert the byte FF into
> > the string, because this isn't a valid codepoint UTF-8. So I proposed to
> > use \x[FF] in Perl7 to insert the byte FF. In Perl 5 \xFF inserts a byte,
> > because 0xFF is smaller then 256, but having \x[FF] to be explicit that
> > you want a byte would be nice.
> 
> I think this becomes a confusion between UTF-8 strings and byte strings.
> 
> Why would you care about the representation in memory? Will the string
> be passed to a C function that expects bytes, and not UTF-8?

Yes, that is a perfect example.

> > PS. This would also solve some EBCDIC problems where in Perl5 \xA4 does not 
> > generate an 'A', on EBCDIC platforms.

Sorry, should have been \x41.

> I don't understand. If it needs to be translated from UTF-8 to EBCDIC when
> output to the screen, then that is where it should happen.

Sorry didn't explain how I think it should work on EBCDIC platform:
For EBCDIC platforms the default encoding is UTF-EBCDIC, and the 
code set would be Unicode.
If you do \x41 that would insert the character with the code point 0x41, which
according to Unicode is the character 'A'. Using the UTF-EBCDIC
encoding, code point 0x41 would have the byte representation 'C1'.
So using this, on EBCDIC you would have: "\x41" eq "\x[C1]"


Gerard Goossen




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About