develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
From:
Tels
Date:
February 29, 2008 05:29
Subject:
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
200802291429.16065@bloodgate.com
On Thursday 28 February 2008 22:12:03 Nicholas Clark wrote:
> On Thu, Feb 28, 2008 at 08:34:12PM +0100, Tels wrote:
> > Moin,
> > On Thursday 28 February 2008 20:22:45 Nicholas Clark wrote:
> > > On Wed, Feb 27, 2008 at 11:57:53AM +0100, Juerd Waalboer wrote:
> > > > demerphq skribis 2008-02-27 11:45 (+0100):
> > [snip]
> > > > A heuristic can never be used to determine validity.
> > > >
> > > > I'm suggesting that Perl should assume UTF-8 in the absence of
> > > > any BOM, but fall back to latin1 decoding if the source turns
> > > > out to be invalid UTF-8. This can be done on a per
> > > > bytesequence, per line, or per file basis and a warning should
> > > > probably emitted if some but not all of the file is valid
> > > > UTF-8.
> > >
> > > I can't see that "per line" is the way to go. A file is either
> > > UTF-8, or ISO-8859-1*, or it's broken and should be fatally
> > > rejected
> > >
> > > (OK, or it's ISO-8859-15 or ISO-8859-\d or Windows 1252, or
> > > Windows *, or all the other character sets with 256 or fewer code
> > > points, where code points 0-127 are identical to ASCII)
> >
> > Yeah.
> >
> > > > It could even be done on a per rest-of-the-file basis: read
> > > > everything as UTF-8, keeping track of whether a non-ASCII UTF-8
> > > > sequence has been encountered. Upon seeing an invalid UTF-8
> > > > byte sequence,
> > > > if ($utf8_sequence_seen) { die } else { switch back to latin1 }
> > >
> > > I'm not convinced that I like heuristics. Klortho says:
> > >
> > >     #11953 Of course, this is a heuristic, which is a fancy way
> > > of saying that it doesn't work.
> > >
> > >
> > > However, this one seems workable. Default is heuristic mode, and
> > > heuristic mode is:
> >
> > So a file with:
> >
> > 30 c3 b6	3÷
> >
> > will be determined to be UTF-8, even tho it is valid ISO-8859-1?
>
> Yes. But all UTF-8 is valid ISO-8859-1, this follows.

Yes, but it is not possible for a computer to decide *which* encoding 
this actually is. It could only guess.

And I am not buying the argument of "oooh it's very rare that this will 
be valid XYZ". It is valid in both, and therefore cannot be decided, 
and therefore adds an element of surprise and uncertainity into the 
code. (Like, add a byte to your source, and suddenly it is interpreted 
in a different encoding).

The upshot of this if this heuristic is added, the first advise for best 
practive will be:

* always add "use utf8;" to your source. Don't skip it (it adds 
unpredictability) and don't "use encoding;" (it doesn't work properly)

Which would somehow defeat the point of adding the heuristic (albeit I 
would like to give the "use utf8;" advice right now, anyway)

> What Juerd has been saying on the list and on IRC is that it is
> vanishingly rare for real data in ISO-8859-1 to contain sequences
> that are actually valid as UTF-8.

And he determined this how? :)

> > > If you want to say that your source code is UTF-8, you
> > >
> > >    use ker_sploosh 'utf-8';
> >
> > You mean
> >
> > use utf8;
> >
> > ? :)
>
> No, I don't. Because of this:

Ah ok. And if I got you right, we can't make it die?

> > > $ perl -lwe 'use utf8; $a = "£"; warn length $a; die "But we
> > > ran"' Malformed UTF-8 character (unexpected continuation byte
> > > 0xa3, with no preceding start byte) at -e line 1. 1 at -e line 1.
> > > But we ran at -e line 1.
> >
> > 	# perl -lwe 'use utf8; $a = "£"; warn length $a; die "But we ran"'
> > 	1 at -e line 1.
> > 	But we ran at -e line 1.
> >
> > I guess you used a different encoding, and the paste into the email
> > made it into UTF-8 somehow?
>
> Everything at my end is ISO-8859-1. I assumed that my e-mail went out
> as 8 bit, not UTF-8.

It does:

	Content-Type: text/plain;
	  charset=iso-8859-1

but it seems my email composer automatically converts it to my native 
UTF-8 upon editing it :) (Sometimes, smart software defeats me :D

> > In any event, I don't see why "use utf-8" shouldn't die when the
> > source contains non-utf-8. After all, you just told Perl it does ;)
>
> I would have liked it if it did. But it already seems that we have it
> the wrong way, and I'd prefer to deprecate the wrongness, than change
> it again.

I am not in a position to argue one way or the other :)

All the best,

Tels

-- 
 Signed on Fri Feb 29 14:22:05 2008 with key 0x93B84C15.
 Get one of my photo posters: http://bloodgate.com/posters
 PGP key on http://bloodgate.com/tels.asc or per email.

 "I want to squirt you a picture of my kids. You want to squirt me back
 a video of your vacation. That's a software experience."

  -- Steve Baller on the Zune

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About