develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
February 28, 2008 13:12
Subject:
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
20080228211202.GO87113@plum.flirble.org
On Thu, Feb 28, 2008 at 08:34:12PM +0100, Tels wrote:
> Moin,
> 
> On Thursday 28 February 2008 20:22:45 Nicholas Clark wrote:
> > On Wed, Feb 27, 2008 at 11:57:53AM +0100, Juerd Waalboer wrote:
> > > demerphq skribis 2008-02-27 11:45 (+0100):
> [snip]
> > > A heuristic can never be used to determine validity.
> > >
> > > I'm suggesting that Perl should assume UTF-8 in the absence of any
> > > BOM, but fall back to latin1 decoding if the source turns out to be
> > > invalid UTF-8. This can be done on a per bytesequence, per line, or
> > > per file basis and a warning should probably emitted if some but
> > > not all of the file is valid UTF-8.
> >
> > I can't see that "per line" is the way to go. A file is either UTF-8,
> > or ISO-8859-1*, or it's broken and should be fatally rejected
> >
> > (OK, or it's ISO-8859-15 or ISO-8859-\d or Windows 1252, or Windows
> > *, or all the other character sets with 256 or fewer code points,
> > where code points 0-127 are identical to ASCII)
> 
> Yeah.
> 
> > > It could even be done on a per rest-of-the-file basis: read
> > > everything as UTF-8, keeping track of whether a non-ASCII UTF-8
> > > sequence has been encountered. Upon seeing an invalid UTF-8 byte
> > > sequence,
> > > if ($utf8_sequence_seen) { die } else { switch back to latin1 }
> >
> > I'm not convinced that I like heuristics. Klortho says:
> >
> >     #11953 Of course, this is a heuristic, which is a fancy way of
> > saying that it doesn't work.
> >
> >
> > However, this one seems workable. Default is heuristic mode, and
> > heuristic mode is:
> 
> So a file with:
> 
> 30 c3 b6	3÷
> 
> will be determined to be UTF-8, even tho it is valid ISO-8859-1?

Yes. But all UTF-8 is valid ISO-8859-1, this follows.

(octets 128-159 are defined as control characters:
http://en.wikipedia.org/wiki/ISO_8859-1#Codepage_layout
(hmm, "ISO-8859-1" isn't "ISO 8859-1". How fun. And it seems that we can't say
we're doing "ISO 8859-1" as then we don't have newline, or any other control
character in the octet range 0-31))

What Juerd has been saying on the list and on IRC is that it is vanishingly
rare for real data in ISO-8859-1 to contain sequences that are actually
valid as UTF-8.

> I don't think this heuristic is ever gonna work, unless you 
> defined "work" as "works sometimes, if you are lucky".

I think it's more likely than than "sometimes, if you are lucky"

> And I don't like this kinda of "fuzzy computing".

No, me neither.

> > If you want to say that your source code is UTF-8, you
> >
> >    use ker_sploosh 'utf-8';
> 
> You mean
> 
> use utf8;
> 
> ? :)

No, I don't. Because of this:

> > $ perl -lwe 'use utf8; $a = "£"; warn length $a; die "But we ran"'
> > Malformed UTF-8 character (unexpected continuation byte 0xa3, with no
> > preceding start byte) at -e line 1. 1 at -e line 1.
> > But we ran at -e line 1.
> 
> 	# perl -lwe 'use utf8; $a = "£"; warn length $a; die "But we ran"'
> 	1 at -e line 1.
> 	But we ran at -e line 1.
> 
> I guess you used a different encoding, and the paste into the email made 
> it into UTF-8 somehow?

Everything at my end is ISO-8859-1. I assumed that my e-mail went out as
8 bit, not UTF-8.


> In any event, I don't see why "use utf-8" shouldn't die when the source 
> contains non-utf-8. After all, you just told Perl it does ;)

I would have liked it if it did. But it already seems that we have it the
wrong way, and I'd prefer to deprecate the wrongness, than change it again.

Nicholas Clark

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About