develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
March 30, 2007 14:44
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
20070330214412.GY31277@c4.convolution.nl
Tels skribis 2007-03-30 23:17 (+0000):
> > If it is so deadly to collide byte-oriented data with character data,
> > it should not be so easy to do so accidentally.
> It can happen everytime you concatenate two strings. Maybe we could add a 
> new warning?

Eh, no, because Perl does not have any metadata telling you if this
non-UTF8 string is a latin1 text string, or just a random byte string.

There is no way to tell Perl how you intended your string to be used,
and there is no way for Perl to tell you the same thing about a string
it returned.

> 	use warnings 'upgrade';

This already exists on CPAN, authored by Audrey Tang, as
encoding::warnings:

    use encoding::warnings;

But it will warn when Perl upgrades latin1 to utf-8, without knowing if
that is a bug or a feature, because it doesn't know if the "latin1"
string was meant as a text string or a byte string.

It's a useful debugging tool, to find unintended upgrades, but you
shouldn't try to avoid upgrading altogether. That just hurts, because
upgrading is part of the way the Perl Unicode model was intended.

> 	* the lenght in bytes
> 	* the lenght in characters (not always set, e.g. can be unknown)
> 	* the storage buffer (containing the data, plus some optional padding)
> 	* the encoding

Hey, cool, Perl has almost the same thing, only it supports just two
encodings: latin1 and utf8. It uses a single bit to indicate the
encoding, the UTF8 flag, which can be on or off. When it's off, the
string is latin1, when it's on, the string is UTF-8.

Maybe you should try Perl; you'll like the way it's built, because it
very closely matches your own design!

The same type of string can be used for binary data, because in the
unicode encoding "latin1", all 256 codepoints map to the same byte
values.

> In short, it becomes a mess.

Yes, with strong typing, especially with string subtypes for arbitrary
encodings, it would be cleaner. But it would also not look like Perl 5.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About