Front page | perl.perl5.porters |
Postings from December 2010
Re: refined :utf8 I/O layers proposal
Thread Previous
|
Thread Next
From:
David Golden
Date:
December 25, 2010 06:17
Subject:
Re: refined :utf8 I/O layers proposal
Message ID:
AANLkTikx-vo9ifjtRXrGTsoDdyXX6-vNxD14AfvgMNSp@mail.gmail.com
On Fri, Dec 24, 2010 at 2:00 PM, Dr.Ruud <rvtol+usenet@isolution.nl> wrote:
> On 2010-12-24 17:32, Zefram wrote:
>
>> Perl's extended UTF-8 is extended precisely in
>> using 0xfe and 0xff octets to extend the range up to 72 bits.
>
> Let's then get any utf reference out of the way, and call it
>
> :int72
> :int_packed
> :pint
+1
If what perl does internally isn't consistent with the formal
definition of UTF-8, then having a different name for encoding it is a
good idea. Obviously, the API can't be changed now, but the encoding
names and some related documentation in, say, utf8.pm could be.
On Karl's proposal, I agree with Zefram that it's headed in the right
direction. Let's say that we call perl's internal encoding
"encoding(int72)" for the sake of argument below. Then we have two
encodings:
:encoding(UTF-8)
:encoding(int72)
To those, we can apply various post-decoding restrictions, such as:
:no_surrogates
:no_nonchars
That's good. I think "no_above_unicode" goes away if we have
encoding(int72) as an alternative to encoding(UTF-8), right?
A question regarding "safety" -- I believe one of the big safety
issues is that UTF-8 must always encode/decode to the shortest
possible sequence. I think that should be true for encoding(UTF-8),
but what about for encoding(int72)? Would there ever be a reason to
allow it to be otherwise (such as working with and fixing known
invalid encoded data)? Would we want
:encoding(int72_raw)
as a means of allowing non-shortest sequences? Alternatively, the
encoding layers could allow such non-shortest sequences and we could
have pre-filters like:
:no_overlong" -- throw an error (or warn and replace with
replacement character)
:shorten_overlong -- replace with shortest equivalent
Supposing that encode(int72) did allow overlong characters, we could
restrict it like this:
:no_overlong:encoding(int72)
With those building blocks, then we alias combinations along the lines
that Karl suggested. I might suggest:
:utf8 == :strict_utf8 == :encoding(UTF-8):no_surrogates:no_nonchars
:lax_utf8 == :no_overlong:encoding(int72)
:unsafe_utf8 == :encoding(int72)
I don't like calling anything ":safe_utf8" because people will use it
without knowing what it means because it sounds good. ("Safe? That
must be what I want!") Calling something "unsafe" has the opposite
effect, which is a good affordance.
-- David
Thread Previous
|
Thread Next