develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
From:
demerphq
Date:
February 27, 2008 03:25
Subject:
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
9b18b3110802270325u147bfa2dmf2dd2355e3756fb@mail.gmail.com
On 27/02/2008, Juerd Waalboer <juerd@convolution.nl> wrote:
> demerphq skribis 2008-02-27 11:45 (+0100):
>
> > >  > * Deprecate non-ASCII characters in Perl 5.12 source code unless a
>  > >  > source encoding is specified.  Make UTF-8, rather than ASCII, the
>  > >  > default source encoding for Perl 5.14.
>  > > I wouldn't object, but would prefer to see Perl 5.12 already interpret
>  > >  source code as UTF-8 if it happens to indeed be valid UTF-8. A silent or
>  > >  warning fallback to latin1 could be used for backwards compatibility.
>  > It does, except it does not use a heuristic to determine if its valid
>  > utf8 (which is the only way to tell), it looks for BOM markers.
>
>
> A heuristic can never be used to determine validity.

Hence why im in favour of BOM's.

>  I'm suggesting that Perl should assume UTF-8 in the absence of any BOM,
>  but fall back to latin1 decoding if the source turns out to be invalid
>  UTF-8. This can be done on a per bytesequence, per line, or per file
>  basis and a warning should probably emitted if some but not all of the
>  file is valid UTF-8.

This is a heuristic . :-)

>
>  It could even be done on a per rest-of-the-file basis: read everything
>  as UTF-8, keeping track of whether a non-ASCII UTF-8 sequence has been
>  encountered. Upon seeing an invalid UTF-8 byte sequence,
>  if ($utf8_sequence_seen) { die } else { switch back to latin1 }
>
>  This would support all UTF-8 encoded source, and 99.9999999999% of all
>  latin1 encoded source. (Estimated, but I think it may be pretty
>  accurate.)
>
>
>  > Frankly im against using heuristics to determine encoding.
>
>
> Normally I am too. However, even though I looked for it, I have yet to
>  encounter any latin1 text that is happens to also be interpretable as
>  valid UTF-8. The likeliness of this ever happening decreases as the
>  source size increases.

Thats true. So its a robust heuristic. Maybe you ave a point there.

>  It'd be bad to detect *arbitrary* encodings based on heuristics, like
>  Internet Explorer does. But determining the difference between UTF-8 and
>  latin1 is actually rather safe.
>
>
>  > Better to just tell people to use editors that ensure the correct BOM
>  > is prepended to the file.
>
>
> Not all editors even support this, and it's far from common practice in
>  the *nix world, whether we like that or not.

Yes, the unix world is sadly lacking in unicode support at virtually
every level. They got a kludge that meant they could have unicode
without any hassle and left it at that, warts and all.

>  Note that looking for an UTF-8 BOM would still be a heuristic. After
>  all, there's no way to know for sure that those 3 bytes weren't meant as
>  the three latin1 characters  (iuml, raquo, iquest). But it is pretty
>  far fetched for any multibyte UTF-8 sequence to be considered as latin1.

Its not really a heuristic tho. Its a straight out documented behaviour.

Yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About