develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
February 28, 2008 11:34
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:

On Thursday 28 February 2008 20:22:45 Nicholas Clark wrote:
> On Wed, Feb 27, 2008 at 11:57:53AM +0100, Juerd Waalboer wrote:
> > demerphq skribis 2008-02-27 11:45 (+0100):
> > A heuristic can never be used to determine validity.
> >
> > I'm suggesting that Perl should assume UTF-8 in the absence of any
> > BOM, but fall back to latin1 decoding if the source turns out to be
> > invalid UTF-8. This can be done on a per bytesequence, per line, or
> > per file basis and a warning should probably emitted if some but
> > not all of the file is valid UTF-8.
> I can't see that "per line" is the way to go. A file is either UTF-8,
> or ISO-8859-1*, or it's broken and should be fatally rejected
> (OK, or it's ISO-8859-15 or ISO-8859-\d or Windows 1252, or Windows
> *, or all the other character sets with 256 or fewer code points,
> where code points 0-127 are identical to ASCII)


> > It could even be done on a per rest-of-the-file basis: read
> > everything as UTF-8, keeping track of whether a non-ASCII UTF-8
> > sequence has been encountered. Upon seeing an invalid UTF-8 byte
> > sequence,
> > if ($utf8_sequence_seen) { die } else { switch back to latin1 }
> I'm not convinced that I like heuristics. Klortho says:
>     #11953 Of course, this is a heuristic, which is a fancy way of
> saying that it doesn't work.
> However, this one seems workable. Default is heuristic mode, and
> heuristic mode is:

So a file with:

30 c3 b6	3÷

will be determined to be UTF-8, even tho it is valid ISO-8859-1?

I don't think this heuristic is ever gonna work, unless you 
defined "work" as "works sometimes, if you are lucky".

And I don't like this kinda of "fuzzy computing".

> If you want to say that your source code is UTF-8, you
>    use ker_sploosh 'utf-8';

You mean

use utf8;

? :)

> (where we need a better name than ker_sploosh) and heuristics are
> off. (And so the first invalid UTF-8 sequence is a fatal error)
> If you want to say that your source code is ISO-8859-1 but happens to
> have some literal sequences that would also be valid if interpreted
> as UTF-8, you say
>    use ker_sploosh 'iso-8859-1';
> If you want iso-8859-15, or Windows 1252 (or Shift-JIS, or strict
> ASCII) you say so.
> And anything that is invalid in your stated (or heuristically
> assumed) encoding is a fatal compile time error. Which is not the
> case with C<use utf8;>
> $ perl -lwe 'use utf8; $a = "£"; warn length $a; die "But we ran"'
> Malformed UTF-8 character (unexpected continuation byte 0xa3, with no
> preceding start byte) at -e line 1. 1 at -e line 1.
> But we ran at -e line 1.

	# perl -lwe 'use utf8; $a = "£"; warn length $a; die "But we ran"'
	1 at -e line 1.
	But we ran at -e line 1.

I guess you used a different encoding, and the paste into the email made 
it into UTF-8 somehow?

In any event, I don't see why "use utf-8" shouldn't die when the source 
contains non-utf-8. After all, you just told Perl it does ;)

All the best,


 Signed on Thu Feb 28 20:29:15 2008 with key 0x93B84C15.
 View my photo gallery:
 PGP key on or per email.

 Miko: "Detect Evil!"
Belkar, holding up check-warding sheet of lead:
 "Too slow, sister."

  -- The Order of The Stick

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About