develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
Nicholas Clark
February 28, 2008 11:22
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
On Wed, Feb 27, 2008 at 11:57:53AM +0100, Juerd Waalboer wrote:
> demerphq skribis 2008-02-27 11:45 (+0100):
> > >  > * Deprecate non-ASCII characters in Perl 5.12 source code unless a
> > >  > source encoding is specified.  Make UTF-8, rather than ASCII, the
> > >  > default source encoding for Perl 5.14.
> > > I wouldn't object, but would prefer to see Perl 5.12 already interpret
> > >  source code as UTF-8 if it happens to indeed be valid UTF-8. A silent or
> > >  warning fallback to latin1 could be used for backwards compatibility.
> > It does, except it does not use a heuristic to determine if its valid
> > utf8 (which is the only way to tell), it looks for BOM markers.
> A heuristic can never be used to determine validity.
> I'm suggesting that Perl should assume UTF-8 in the absence of any BOM,
> but fall back to latin1 decoding if the source turns out to be invalid
> UTF-8. This can be done on a per bytesequence, per line, or per file
> basis and a warning should probably emitted if some but not all of the
> file is valid UTF-8.

I can't see that "per line" is the way to go. A file is either UTF-8,
or ISO-8859-1*, or it's broken and should be fatally rejected

(OK, or it's ISO-8859-15 or ISO-8859-\d or Windows 1252, or Windows *,
or all the other character sets with 256 or fewer code points, where code
points 0-127 are identical to ASCII)

> It could even be done on a per rest-of-the-file basis: read everything
> as UTF-8, keeping track of whether a non-ASCII UTF-8 sequence has been
> encountered. Upon seeing an invalid UTF-8 byte sequence,
> if ($utf8_sequence_seen) { die } else { switch back to latin1 }

I'm not convinced that I like heuristics. Klortho says:

    #11953 Of course, this is a heuristic, which is a fancy way of saying
    that it doesn't work.

However, this one seems workable. Default is heuristic mode, and heuristic
mode is:

Default state at the start of the file is that it's undecided.

Whilst the file continues to be clean 7 bit ASCII, nothing changes.

If the first sequence of octets >127 is valid UTF-8, then the file is assumed
(from here on in) to be UTF-8, and any invalid UTF-8 is a fatal compile error.

If the first sequence of octets >127 is not valid UTF-8, then the file is
assumed (from here on in) to be ISO-8859-1, and if this subsequent sequence of
octets >127 is (either) valid UTF-8, or in the range 128-159, then it is a
fatal compile time error.

BOM markers are valid UTF-8 sequences, so the rules above will automatically
make anything starting with a BOM into UTF-8.

If you want to say that your source code is UTF-8, you

   use ker_sploosh 'utf-8';

(where we need a better name than ker_sploosh) and heuristics are off.
(And so the first invalid UTF-8 sequence is a fatal error)

If you want to say that your source code is ISO-8859-1 but happens to have
some literal sequences that would also be valid if interpreted as UTF-8, you

   use ker_sploosh 'iso-8859-1';

If you want iso-8859-15, or Windows 1252 (or Shift-JIS, or strict ASCII) you
say so.

And anything that is invalid in your stated (or heuristically assumed) encoding
is a fatal compile time error. Which is not the case with C<use utf8;>

$ perl -lwe 'use utf8; $a = "£"; warn length $a; die "But we ran"'
Malformed UTF-8 character (unexpected continuation byte 0xa3, with no preceding start byte) at -e line 1.
1 at -e line 1.
But we ran at -e line 1.

Nicholas Clark

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About