On Sat, Dec 25, 2010 at 9:17 AM, David Golden <xdaveg@gmail.com> wrote: > On Karl's proposal, I agree with Zefram that it's headed in the right > direction. Let's say that we call perl's internal encoding > "encoding(int72)" for the sake of argument below. Then we have two > encodings: > > :encoding(UTF-8) > :encoding(int72) > There's really three. UTF-8 for interchange, UTF-8 for intrachange and int72. > A question regarding "safety" -- I believe one of the big safety > issues is that UTF-8 must always encode/decode to the shortest > possible sequence. [...] Would we want :encoding(int72_raw) > as a means of allowing non-shortest sequences? > I don't see any benefit, and they are lots of downsides. For example, "eq" doesn't recognize different encodings of the same character. At the very least, we should officially not support longer than minimal encodings. But I don't think that's enough, especially given how easy overly long encodings are to detect. Any instances of the following bytes indicates an overly long encoding: 0b11000000 0b11100000 0b11110000 etc As such, I recommend we do something about them. Options: - warn and let it through (yuck!) - warn and substitute in U+FFFD - warn and recode - recode (no warning) - EricThread Previous | Thread Next