develooper Front page | perl.perl5.porters | Postings from September 2012

:utf8 status

Thread Next
Leon Timmermans
September 7, 2012 09:47
:utf8 status
Message ID:
Hi Porters,

As some of you may know, I've been working on a new :utf8 layer
together with Christian Hansen that is supposed to fix a number of
issues with the current. In particular, the fact that :utf8 currently
doesn't do any form of validation, it is only a flag that tells perl
it should assume the bytestream is actually validly encoded utf-8.
This has a number of important (security) implications that have been
warned against for almost 5 years now.

The smoke-me/leont/safe-utf8 branch replaces it with an actual layer
that checks the bytestream for utf-8 validity. By default it does
strict checking: it mandates not only well-formedness but also forbids
surrogates, non-characters and out-of-plane codepoints. That said, it
can be made some more permissive on the latter points (some people
want that). Unlike :encoding(utf-8) it's written in pure-C and as such
is both clone-safe and fast. It's currently also living on CPAN as

There are two bugs that are blocking it though. The first is #111542,
«":bytes" is broken too». binmode ":bytes" currently disables the UTF8
flag, without taking into account if this is a logical thing or not.
it curren't doesn't DWIM on any additional layer such as :encoding,
and my patches would extend it to ":utf8".  I think the only sensible
fix is to change it to pop off any layers which would logically only
provide characters, not bytes. This patch is supplied too in my

More importantly, there's bug #113424: «:stdio + any other layer
hangs». This means that the new :utf8 layer doesn't work correctly on
top of :stdio on "slow devices" (e.g. ttys, pipes & sockets), like
:encoding already doesn't (though no one seems to have noticed that in
the past 5 years, so I'm not sure it's really all that vital). The
only way how to fix this that I can see would involve refactoring much
of of our readline implementation into PerlIO. That'd be a good idea
anyway (really, why doesn't PerlIO have a readline?), but this may be
a significant amount of work. At some level I'd like to just drop
":stdio", but there's still code out there that depends on it in
horribly buggy ways (see #114608). At some level doing both may be a
good idea.

As usual, this whole thing turns out to be much more complicated than
it should have been :-/


Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About