develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Tels
Date:
March 30, 2007 15:20
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
200703310019.16984@bloodgate.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Friday 30 March 2007 21:44:12 Juerd Waalboer wrote:
> Tels skribis 2007-03-30 23:17 (+0000):
> > > If it is so deadly to collide byte-oriented data with character data,
> > > it should not be so easy to do so accidentally.
> >
> > It can happen everytime you concatenate two strings. Maybe we could add
> > a new warning?
>
> Eh, no, because Perl does not have any metadata telling you if this
> non-UTF8 string is a latin1 text string, or just a random byte string.
>
> There is no way to tell Perl how you intended your string to be used,
> and there is no way for Perl to tell you the same thing about a string
> it returned.
>
> > 	use warnings 'upgrade';
>
> This already exists on CPAN, authored by Audrey Tang, as
> encoding::warnings:
>
>     use encoding::warnings;
>
> But it will warn when Perl upgrades latin1 to utf-8, without knowing if
> that is a bug or a feature, because it doesn't know if the "latin1"
> string was meant as a text string or a byte string.
>
> It's a useful debugging tool, to find unintended upgrades, but you
> shouldn't try to avoid upgrading altogether. That just hurts, because
> upgrading is part of the way the Perl Unicode model was intended.
>
> > 	* the lenght in bytes
> > 	* the lenght in characters (not always set, e.g. can be unknown)
> > 	* the storage buffer (containing the data, plus some optional padding)
> > 	* the encoding
>
> Hey, cool, Perl has almost the same thing, only it supports just two
> encodings: latin1 and utf8. It uses a single bit to indicate the
> encoding, the UTF8 flag, which can be on or off. When it's off, the
> string is latin1, when it's on, the string is UTF-8.
>
> Maybe you should try Perl; you'll like the way it's built, because it
> very closely matches your own design!

First for the record:

The application I am outfitting is written in C, for speed, and quite large. 
So there is NO way I would even consider to rewrite it in Perl. I'm just 
using the right tool for the right job. That doesn't mean I do not like 
Perl, or the way Perl does things. Sorry if this sounded like it.

Anyway, I wasn't aware that any non-utf8 data in Perl is *always* 
ISO-8859-1, I thought that, when not specified, this depended on some other 
stuff. Guess I need to reread the tutorials. :)

However, this also poses the question: How does Perl know that your data is 
in KOI8-R?

(Yes, that's a trick question, but I would like to hear your answer to that, 
in any case, just to make it clear to me. No offence meant!)

One of the limitations of the "there can be only two encodings" of Perl 
seems to be that strings are permanently upgraded:

	$iso_8859_1 = '...';
	$utf8 = '...';

	if ($iso_8859_1 eq $utf8) { ... }

Please correct me if I am wrong, but I do think it is not be possible to 
keep both variables in their current encoding and only temporarily upgrade 
them to utf8 (for the common encoding that contains both of them)?

After reading this discussion here, a lot of problems also seem to stem from 
the fact that the upgrade to utf8 is permanent, silently and 
done "behind-the-scenes". Just like 1 + 2.0 will result in 3.0 and not 3 
and we all know how much confusion this creates :) (heh, I fell for it 
today, even tho I should have know better :)

> The same type of string can be used for binary data, because in the
> unicode encoding "latin1", all 256 codepoints map to the same byte
> values.

This sounds like a circular definition, because in CP1250, also all 256 
codepoints map to the same byte values. Except it are different byte 
values :)

In my application, I also considered having a "BINARY" encoding, but in the 
end I opted to make ISO-8859-1 the default encoding for BINARY stuff. (Ha, 
great minds sink alike or so) And since unlike in Perl, upgradings are 
never done permanently, you can keep your BINARY string and compare it to 
UTF-8 whatever, and it never gets "corrupted".

I am not sure how one could achive that in Perl. Making the SV read-only?

> > In short, it becomes a mess.
>
> Yes, with strong typing, especially with string subtypes for arbitrary
> encodings, it would be cleaner. But it would also not look like Perl 5.

Over the years, I come to the insight that I want to build reliable and fast 
programs. (easy to maintain, reliable, fast, pick two :-)

So maybe we really need "use strict 'encodings';" :-)

All the best,

Tels

- -- 
 Signed on Sat Mar 31 00:04:29 2007 with key 0x93B84C15.
 Get one of my photo posters: http://bloodgate.com/posters
 PGP key on http://bloodgate.com/tels.asc or per email.

 "Blogebrity: Wow, guess what this one stands for? Too easy. Hey, anyone
 can do it: take a blogger who's a chef, and you get: BLEF. A blogger
 who's a dentist? BENTIST. A female blogger with an itch? You guessed it:
 a BITCH."

  -- maddox from xmission
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg2pBHcLPEOTuEwVAQKxJQf/UKYZhHUkTkH6wpP/uLQ+zkEO/8ptDA4i
7lQipjOIkGlcLc0peF0sr2jlNu59XWSVbDeYdSSdJGWYvydYbeToP180xaBms40a
GdL/5QWlgUalQ1sifs93r1pfx+AQv1Pc4TivybFj/SbYY5WYe7pcaZDZ80/luYtp
ftxd+96KLVshZ/2bMtxjJ7yo2k7oD0uwA2MF1SFiytjSFZZ+QRol2G7PbsIaqonc
ITDrEm+R+djp9FLFKlXQIs3/jNx2wOhoS5z6Q3HKIi9KrXfMngyZa4cvpSmm071l
ETbRT4gy+1O7fFvsFG8xrtyajO95LpSPhZ1aeYR7fPpj0zLP6KNqxQ==
=jV6Z
-----END PGP SIGNATURE-----

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About