develooper Front page | perl.perl5.porters | Postings from December 2010

Re: refined :utf8 I/O layers proposal

Thread Previous | Thread Next
David Golden
December 25, 2010 06:17
Re: refined :utf8 I/O layers proposal
Message ID:
On Fri, Dec 24, 2010 at 2:00 PM, Dr.Ruud <> wrote:
> On 2010-12-24 17:32, Zefram wrote:
>> Perl's extended UTF-8 is extended precisely in
>> using 0xfe and 0xff octets to extend the range up to 72 bits.
> Let's then get any utf reference out of the way, and call it
>  :int72
>  :int_packed
>  :pint


If what perl does internally isn't consistent with the formal
definition of UTF-8, then having a different name for encoding it is a
good idea.  Obviously, the API can't be changed now, but the encoding
names and some related documentation in, say, could be.

On  Karl's proposal, I agree with Zefram that it's headed in the right
direction.  Let's say that we call perl's internal encoding
"encoding(int72)" for the sake of argument below.  Then we have two


To those, we can apply various post-decoding restrictions, such as:


That's good.  I think "no_above_unicode" goes away if we have
encoding(int72) as an alternative to encoding(UTF-8), right?

A question regarding "safety" -- I believe one of the big safety
issues is that UTF-8 must always encode/decode to the shortest
possible sequence.  I think that should be true for encoding(UTF-8),
but what about for encoding(int72)? Would there ever be a reason to
allow it to be otherwise (such as working with and fixing known
invalid encoded data)? Would we want


as a means of allowing non-shortest sequences?  Alternatively, the
encoding layers could allow such non-shortest sequences and we could
have pre-filters like:

  :no_overlong"  -- throw an error (or warn and replace with
replacement character)
  :shorten_overlong -- replace with shortest equivalent

Supposing that encode(int72) did allow overlong characters, we could
restrict it like this:


With those building blocks, then we alias combinations along the lines
that Karl suggested. I might suggest:

  :utf8 == :strict_utf8 == :encoding(UTF-8):no_surrogates:no_nonchars
  :lax_utf8 == :no_overlong:encoding(int72)
  :unsafe_utf8 == :encoding(int72)

I don't like calling anything ":safe_utf8" because people will use it
without knowing what it means because it sounds good.  ("Safe?  That
must be what I want!")  Calling something "unsafe" has the opposite
effect, which is a good affordance.

-- David

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About