develooper Front page | perl.perl5.porters | Postings from December 2010

Re: refined :utf8 I/O layers proposal

Thread Previous | Thread Next
December 24, 2010 08:32
Re: refined :utf8 I/O layers proposal
Message ID:
karl williamson wrote:
>I believe this gives the orthogonality that xdg wants;

It's getting closer.  It would help if you described the various :foo_utf8
layers in terms of equivalent pairs of encoding and strictness layers.

I'd like to see a strict distinction between standard UTF-8 and Perl's
internal extended UTF-8.  This is not a matter for the strictness axis,
it's better treated as an encoding matter.  Your discussion for :safe_utf8
suggests that you're not entirely clear about it.  Standard UTF-8 can
represent any codepoint up to 31 bits, and never uses 0xfe or 0xff octets
in the encoded form.  Perl's extended UTF-8 is extended precisely in
using 0xfe and 0xff octets to extend the range up to 72 bits.  If I ask
for standard UTF-8 decoding, any 0xfe or 0xff on the input must be an
error, no matter how permissive I am about which characters I'll accept.

So, suppose we have :encoding(UTF-8) for standard UTF-8, and
:encoding(utf8) for Perl's extended UTF-8.  (I'd really like to deprecate
the latter name, if possible.)  And on the strictness axis suppose we
have :no_surrogates, :no_above_unicode, and :no_nonchars, as you describe.
I think your :foo_utf8 layers then are defined thus:

    :strict_utf8  ==  :encoding(UTF-8)
                      :no_surrogates :no_above_unicode :no_nonchars
    :safe_utf8    ==  :encoding(UTF-8) :no_surrogates :no_nonchars
    :unsafe_utf8  ==  :encoding(utf8)

Obviously, by taking advantage of the orthogonality there are many
other :utf8-like layer combinations that could be named.  I don't
have very strong opinions about which ones ought to have short names,
other than that the easiest to use, :utf8, ought to be quite strict.
I don't think :encoding(utf8) with no strictures (your :unsafe_utf8)
deserves a short name.

The combination of the three stricture layers ought to have a short name,
possibly ":strict_unicode".


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About