develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
February 27, 2008 03:01
Subject:
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
20080227105753.GW13615@c4.convolution.nl
demerphq skribis 2008-02-27 11:45 (+0100):
> >  > * Deprecate non-ASCII characters in Perl 5.12 source code unless a
> >  > source encoding is specified.  Make UTF-8, rather than ASCII, the
> >  > default source encoding for Perl 5.14.
> > I wouldn't object, but would prefer to see Perl 5.12 already interpret
> >  source code as UTF-8 if it happens to indeed be valid UTF-8. A silent or
> >  warning fallback to latin1 could be used for backwards compatibility.
> It does, except it does not use a heuristic to determine if its valid
> utf8 (which is the only way to tell), it looks for BOM markers.

A heuristic can never be used to determine validity.

I'm suggesting that Perl should assume UTF-8 in the absence of any BOM,
but fall back to latin1 decoding if the source turns out to be invalid
UTF-8. This can be done on a per bytesequence, per line, or per file
basis and a warning should probably emitted if some but not all of the
file is valid UTF-8.

It could even be done on a per rest-of-the-file basis: read everything
as UTF-8, keeping track of whether a non-ASCII UTF-8 sequence has been
encountered. Upon seeing an invalid UTF-8 byte sequence,
if ($utf8_sequence_seen) { die } else { switch back to latin1 }

This would support all UTF-8 encoded source, and 99.9999999999% of all
latin1 encoded source. (Estimated, but I think it may be pretty
accurate.)

> Frankly im against using heuristics to determine encoding.

Normally I am too. However, even though I looked for it, I have yet to
encounter any latin1 text that is happens to also be interpretable as
valid UTF-8. The likeliness of this ever happening decreases as the
source size increases.

It'd be bad to detect *arbitrary* encodings based on heuristics, like
Internet Explorer does. But determining the difference between UTF-8 and
latin1 is actually rather safe.

> Better to just tell people to use editors that ensure the correct BOM
> is prepended to the file.

Not all editors even support this, and it's far from common practice in
the *nix world, whether we like that or not.

Note that looking for an UTF-8 BOM would still be a heuristic. After
all, there's no way to know for sure that those 3 bytes weren't meant as
the three latin1 characters  (iuml, raquo, iquest). But it is pretty
far fetched for any multibyte UTF-8 sequence to be considered as latin1.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####@juerd.nl>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <sales@convolution.nl>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About