develooper Front page | perl.perl5.porters | Postings from December 2010

Re: refined :utf8 I/O layers proposal

Thread Previous | Thread Next
karl williamson
December 24, 2010 09:19
Re: refined :utf8 I/O layers proposal
Message ID:
Zefram wrote:
> karl williamson wrote:
>> I believe this gives the orthogonality that xdg wants;
> It's getting closer.  It would help if you described the various :foo_utf8
> layers in terms of equivalent pairs of encoding and strictness layers.
> I'd like to see a strict distinction between standard UTF-8 and Perl's
> internal extended UTF-8.  This is not a matter for the strictness axis,
> it's better treated as an encoding matter.  Your discussion for :safe_utf8
> suggests that you're not entirely clear about it.  Standard UTF-8 can
> represent any codepoint up to 31 bits, and never uses 0xfe or 0xff octets
> in the encoded form.  Perl's extended UTF-8 is extended precisely in
> using 0xfe and 0xff octets to extend the range up to 72 bits.  If I ask
> for standard UTF-8 decoding, any 0xfe or 0xff on the input must be an
> error, no matter how permissive I am about which characters I'll accept.

I think we are using the term "standard UTF-8" differently.  I'm using 
it according to the Unicode standard's definition.  UTF-8 for them does 
not include anything above code point 0x10FFFF, nor surrogates.  The 
non-characters are also not allowed in UTF-8 in open interchange.  By 
the Unicode definition, Perl's extended UTF-8 is not just going beyond 
31 bits.  I am of the opinion that we should use the standard's 
definition of standard utf8 in our documentation, as that is what the 
rest of the world will be thinking we mean.

I think that we need to resolve what we mean by standard UTF-8 before 
deciding further on the various layers.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About