develooper Front page | perl.perl5.porters | Postings from January 2011

Re: RFC: Because of security, deprecate :utf8 layer in favor of :utf8_no_check

Thread Previous | Thread Next
Nicholas Clark
January 3, 2011 09:22
Re: RFC: Because of security, deprecate :utf8 layer in favor of :utf8_no_check
Message ID:
On Fri, Dec 17, 2010 at 03:00:50PM -0700, karl williamson wrote:
> Nicholas Clark wrote:
> >On Wed, Dec 15, 2010 at 01:09:41PM -0700, karl williamson wrote:
> >>Many people use :utf8 without realizing the security implications.
> >>
> >>I claim that they need to be warned about that.  Deprecating the name, 
> >>and changing it to one like :utf8_no_check (better name suggestions 
> >>welcome) would accomplish this.
> >
> >I would strongly prefer to change :utf8 to check for structural errors, by
> >which I mean:
> >
> >1: unexpected continuation bytes
> >2: missing continuation bytes
> >3: overlong sequences
> >4: UTF-16 surrogates
> >
> >ie non-characters and beyond Unicode code points are not errors. The above
> >would become I/O errors.
> >
> >I think I can see how to do this, with probably little performance impact.

> Do you then propose a new something like "unchecked_utf8" for those who 
> wish to bypass this? Or do we never allow malformed input?

I don't know.
We can't stop someone writing the "unchecked" code and putting it on CPAN.
Then again, we don't have to ship the "unchecked" code simply because if it's
an itch that needs scratching [with a loaded shotgun, obviously :-)] it
*can* be done by CPAN.

In which case, the only "user" option is really whether to consider surrogates
as errors, or as acceptable.

> The only problem I have with this proposal is that I believe that the 
> default should include checking for non-character code points and above 
> Unicode code points, but there does need to be a way to turn this 
> portion of checking off?

I could see how to easily turn on/off checking of the above 4 independently,
using the same code but changing the driver table for the decoding
(see attached code), and "above Unicode" and "above UTF-8" (as per the
original spec, to 0x7FFFFFFF, IIRC)

Having looked up non-character code points, to confirm, you mean:

    There are sixty-six noncharacters: U+FDD0..U+FDEF and any code
    point ending in the value FFFE or FFFF (i.e. U+FFFE, U+FFFF,
    U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF). The set of
    noncharacters is stable, and no new noncharacters will ever be

If so, I think that (conditionally) spotting

a: U+FDD0..U+FDEF is easy, as they all start with 0xE3 in UTF-8
b: U+FFFE, U+FFFF is easy, as again they both start with 0xE3
c: U+1FFFE, U+1FFFF onwards is easy enough, as they are all in the 4 byte

but probably spotting a+b+c has to be enabled/disabled as one choice.

On Mon, Jan 03, 2011 at 01:25:59PM +0800, Jesse Vincent wrote:

> On Wed 15.Dec'10 at 20:39:34 +0000, Nicholas Clark wrote:

> > I think I can see how to do this, with probably little performance impact.
> I know this got a lot more discussion after this point. In general,
> Nick's proposal sounds rational to me. What's the current state of
> things?

I had wanted to work on this whilst on holiday* in Vienna, but

a: nasty persistent stomach bug**
b: laptop power supply failure
c: unexpected level of social interruption

has rather scuppered it.

I had written code about a month ago as a prototype (attached). It caused me
to find and also fix this:

commit a18d6e6e4cf998a0ba9067ceac2d75f71aedef15
Author: Nicholas Clark <>
Date:   Tue Dec 21 16:55:38 2010 +0000

    Fix IS_UTF8_CHAR() to recognise start bytes 0xF5, 0xF6, 0xF7.

    The refactoring of 3b0fc154d4e77cfb inadvertently introduced a bug
    in Perl_is_utf8_char() and its callers, such as Perl_is_utf8_string(),
    whereby the beyond-Unicode characters 0x140000 to 0x1fffff were no longer
    recognised as valid.

After a quite a bit of faffing with callgrind, I think that verified that the
code is at least as good as the current "fast" code. (But that's on my work
desktop, which is 1000 miles away and turned off, and I think it's a variant
of the attached code, with 2 byte UTF-8 handled outside the switch statement).
Still to do, I think, not ordered:

0: Eliminate EBCDIC support.
   [I can't see how to get any of this "right" on EBCDIC, blind.
    We're not getting any input, help, *anything* from anyone using Perl on
    EBCDIC. We have no access to test there, let alone anyone actually
    contributing. Contrast with (admittedly ASCII platforms) VOS, Haiku, etc,
    where we have active feedback for when we get things wrong.
    I believe that "we" can put EBCDIC *back*, if someone wants to pay for
    it, for a value of "pay" that is mutually agreeable up front, so not just
    "lobbing patches over the fence", and that really that is the only viable

1: Resolve bug 79960

2: Produce some suitable benchmarks for the current behaviour for $/ of
   a: undef
   b: "\n"
   c: ""
   d: "\n\n"
   e: \1024 # for which values of 1024?
   f: something Unicode
   for "typical" input
   and test for both values of PerlIO_fast_gets(fp) [see sv_gets()]

3: Refactor PerlIO so that a read larger than the buffer size bypasses the
   buffer [more specifically, quantized by buffer sizes first]

4: Evaluate whether the "this screams louder" slurping block is still needed
   (ie whether modern stdio on "various" platforms is good enough not to need
   to try to cheat, or whether cheating is even counterproductive)

5: Write the tests

6: Split apart all the code into "is UTF-8" vs "is binmode"

7: Convert the 4 to 6 places (ie maybe "this screams louder") on the UTF-8
   halves to use the validation code whilst copying.

Nicholas Clark

*  for a holiday defined as "where one does want one wants to do, rather than
   what anyone else wants one to do"
** I must have been ill. I didn't want to eat or drink.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About