develooper Front page | perl.perl5.porters | Postings from December 2010

Re: refined :utf8 I/O layers proposal

Thread Previous | Thread Next
From:
David Golden
Date:
December 25, 2010 06:17
Subject:
Re: refined :utf8 I/O layers proposal
Message ID:
AANLkTikx-vo9ifjtRXrGTsoDdyXX6-vNxD14AfvgMNSp@mail.gmail.com
On Fri, Dec 24, 2010 at 2:00 PM, Dr.Ruud <rvtol+usenet@isolution.nl> wrote:
> On 2010-12-24 17:32, Zefram wrote:
>
>> Perl's extended UTF-8 is extended precisely in
>> using 0xfe and 0xff octets to extend the range up to 72 bits.
>
> Let's then get any utf reference out of the way, and call it
>
>  :int72
>  :int_packed
>  :pint

+1

If what perl does internally isn't consistent with the formal
definition of UTF-8, then having a different name for encoding it is a
good idea.  Obviously, the API can't be changed now, but the encoding
names and some related documentation in, say, utf8.pm could be.

On  Karl's proposal, I agree with Zefram that it's headed in the right
direction.  Let's say that we call perl's internal encoding
"encoding(int72)" for the sake of argument below.  Then we have two
encodings:

  :encoding(UTF-8)
  :encoding(int72)

To those, we can apply various post-decoding restrictions, such as:

  :no_surrogates
  :no_nonchars

That's good.  I think "no_above_unicode" goes away if we have
encoding(int72) as an alternative to encoding(UTF-8), right?

A question regarding "safety" -- I believe one of the big safety
issues is that UTF-8 must always encode/decode to the shortest
possible sequence.  I think that should be true for encoding(UTF-8),
but what about for encoding(int72)? Would there ever be a reason to
allow it to be otherwise (such as working with and fixing known
invalid encoded data)? Would we want

  :encoding(int72_raw)

as a means of allowing non-shortest sequences?  Alternatively, the
encoding layers could allow such non-shortest sequences and we could
have pre-filters like:

  :no_overlong"  -- throw an error (or warn and replace with
replacement character)
  :shorten_overlong -- replace with shortest equivalent

Supposing that encode(int72) did allow overlong characters, we could
restrict it like this:

  :no_overlong:encoding(int72)

With those building blocks, then we alias combinations along the lines
that Karl suggested. I might suggest:

  :utf8 == :strict_utf8 == :encoding(UTF-8):no_surrogates:no_nonchars
  :lax_utf8 == :no_overlong:encoding(int72)
  :unsafe_utf8 == :encoding(int72)

I don't like calling anything ":safe_utf8" because people will use it
without knowing what it means because it sounds good.  ("Safe?  That
must be what I want!")  Calling something "unsafe" has the opposite
effect, which is a good affordance.

-- David

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About