develooper Front page | perl.perl5.porters | Postings from December 2010

Re: refined :utf8 I/O layers proposal

Thread Previous | Thread Next
From:
Eric Brine
Date:
December 25, 2010 14:07
Subject:
Re: refined :utf8 I/O layers proposal
Message ID:
AANLkTinMLSxPN9+nY_CVS_JeNYvmhih_C8cd2bxW7LPo@mail.gmail.com
On Sat, Dec 25, 2010 at 9:17 AM, David Golden <xdaveg@gmail.com> wrote:

> On  Karl's proposal, I agree with Zefram that it's headed in the right
> direction.  Let's say that we call perl's internal encoding
> "encoding(int72)" for the sake of argument below.  Then we have two
> encodings:
>
>  :encoding(UTF-8)
>  :encoding(int72)
>

There's really three. UTF-8 for interchange, UTF-8 for intrachange and
int72.


> A question regarding "safety" -- I believe one of the big safety
> issues is that UTF-8 must always encode/decode to the shortest
> possible sequence. [...] Would we want :encoding(int72_raw)
>
 as a means of allowing non-shortest sequences?
>

I don't see any benefit, and they are lots of downsides. For example, "eq"
doesn't recognize different encodings of the same character. At the very
least, we should officially not support longer than minimal encodings. But I
don't think that's enough, especially given how easy overly long encodings
are to detect. Any instances of the following bytes indicates an overly long
encoding:

0b11000000
0b11100000
0b11110000
etc

As such, I recommend we do something about them. Options:

   - warn and let it through (yuck!)
   - warn and substitute in U+FFFD
   - warn and recode
   - recode (no warning)

- Eric

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About