develooper Front page | perl.perl5.porters | Postings from February 2007

Re: Future Perl development

From:
Juerd Waalboer
Date:
February 5, 2007 13:41
Subject:
Re: Future Perl development
Message ID:
20070205214115.GB25362@c4.convolution.nl
Gerard Goossen skribis 2007-02-05 20:39 (+0100):
> Sometimes you need have a byte-string.

Indeed.

> But \x.. generates a character.

(Note that \xFF and \x{ff} are the same, for any capitalization of ff.)

Or a byte. Because of the clever Unicode implementation in Perl, you get
a character if you use the return value in a unicode string, and a byte
if you use the return value in a byte string.
    
This is not a matter of context, by the way. Instead, the value "\xFF"
is polymorphic. It's both a unicode string representing code point
U+00FF, and the single byte 0xFF.

If you start using the value in a unicode string, it is always codepoint
U+00FF. Depending on the internal encoding of the unicode string that
you mixed it with, it either stays the same (latin1), or is upgraded
(utf8). After this upgrade (or not), the codepoint is still U+00FF.

If you start using the value in a byte string, it is always byte 0xFF.

Because latin1 perfectly overlaps with the old de facto charset, and
is also a perfect subset of Unicode, there is no need to differentiate
between unicode strings and byte strings internally. This makes it fully
backwards compatible with existing or naive code that doesn't deal with
character encodings, as long as you don't mix it with the new stuff.

Yes, this does mean that the programmer should be more careful.

> In Perl 5 \xFF generates a byte. But if your target encoding is UTF-8,
> \xFF generates two bytes.

There isn't really such a thing as a "target encoding". Perl only
internally keeps track of encodings. You can't specifically make $foo a
windows-1252 string.

There are unicode strings and byte strings. You can't tell which
scalar contains which kind, but if you properly keep them separated,
there is no need for that.

It makes no sense, whatsoever, to mix unicode strings and byte strings.
I've explained that before, but will continue doing so until it sticks:

    When you need to extract text from a byte string, it needs to be
    I<decoded> in some way. You MUST know the encoding in order to
    decode it properly. For decoding, one uses the C<decode> function:

        use Encode qw(decode);
        my $unicode_string = decode("iso-8859-3", $byte_string);

    When you need to use text in a byte string, it needs to be I<encoded>
    in some way. You MUST know the encoding in order to encode it
    properly. For encoding, one uses the C<encode> function:

        use Encode qw(encode);
        my $byte_string = encode("CP850", $unicode_string);

    To convert a foo-encoded byte string to a bar-encoded byte string,
    use a temporary unicode string:

        use Encode qw(decode encode);
        my $unicode_string = decode("foo", $source_byte_string);
        my $target_byte_string = encode("bar", $unicode_string);

    or use the shortcut function C<from_to>:

        use Encode qw(from_to);
        my $target_byte_string = from_to($source_byte_string, "foo", "bar");

Note that unicode strings don't have an encoding. Encodings are byte
stuff, and unicode strings do characters, not bytes. Sure, internally,
they need some encoding: everything is zeroes and ones inside your
computer. You don't need to know the internal encodings.

> And there is no way to insert the byte FF into the string, because
> this isn't a valid codepoint UTF-8.

I'll assume that you mean that the single byte 0xFF is invalid in utf8.

That is no problem, because unicode strings aren't utf8 strings in
I<real> Perl. Unicode strings (also called "text strings") are
I<unicode> strings. And "unicode" is not synonymous with "utf8" in any
way. iso-8859-1 is a great unicode encoding, if you only need code
points up to U+00FF.

Because 0xFF as a byte doesn't make sense in a utf8 string, it is
automatically utf8-encoded whenever you use it in a unicode string that
is encoded as utf8 internally. This is behind-the-scenes stuff, that you
need not worry about.

    my $foo = "L'Ha\xFF-les-Roses";  # a place in France
    my $bar = "Welcome to \x{2740} $foo \x{2740}";

Here, $foo and $bar are obviously text stings. Another name for "text
string" is "unicode string", because Perl does unicode for all text
strings.

As for the internal encodings, why care about them? Perl regulates this
for you, and does so quite well!

But I'll explain it anyway.

$foo is probably encoded as iso-8859-1 (latin1) internally, because that
fits. It may be encoded as utf8 too. 

$bar is encoded as utf8 internally, because the flowers don't fit in
latin1. $foo, interpolated in $bar, will match that encoding. This means
that if it wasn't utf8 already, it is automatically upgraded from latin1
to utf8.

As you can see, there is no need to worry about 0xFF being an invalid
utf8 sequence here! 

> So I proposed to use \x[FF] in Perl7 to insert the byte FF. 

For inserting the byte 0xFF into byte strings, there is already \xFF.

For inserting the byte 0xFF into unicode strings, there is nothing,
fortunately. Having something that messes (and destroys) the internal
coherence, would be as bad as the \C escape in regular expressions, but
more destructive.

In I<real> Perl 5, your proposal makes no sense. 

In I<your> Perl 5, which is heavily patched, it may make sense. I don't
know.

> In Perl 5 \xFF inserts a byte, because 0xFF is smaller then 256, but
> having \x[FF] to be explicit that you want a byte would be nice.

For aforementioned reason, I think that it would not be nice at all.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About