develooper Front page | perl.perl5.porters | Postings from February 2012

Re: Broken API for is_utf8_char()

Thread Previous | Thread Next
From:
Karl Williamson
Date:
February 3, 2012 13:20
Subject:
Re: Broken API for is_utf8_char()
Message ID:
4F2C4F79.6010502@khwilliamson.com
On 11/21/2011 12:04 AM, Jarkko Hietaniemi wrote:
> While agreeing that the is_utf8_char() API is horribly insecure (well,
> as insecure as, say, strlen() when applied to unknown data) (and FWIW,
> I might be the perpetrator), I'd say that depending on its uses it
> might make more to split in few new APIs. Namely, if we do a lot
>
> is_utf8_char(p, MAX_CONSTANT)
>
> that is inherently no better than an unbounded scan, on untrusted data.
>
> If we do a lot
>
> is_utf8_char(p, n)
>
> where n is the correct number of UTF-8 bytes, hang on, how did we know
> it anyway? The only case I can think a generic <ptr, len> approach
> seems reasonable is
>
> is_utf8_char(p, end - p)
>
> in which case we often already know something more about this buffer
> ending in end. Like if the p is actually SvPVX(sv), in which case
> the full API with the raw pointer and length is daft.
>

I put this on the back burner, for a while, but have thought about it 
some more.  What I now propose to do is to take Nicholas' advice and 
change this function to deprecated in embed.fnc, and to create a new 
function, say.

  is_utf8_char_len(p, bufsize)

The API may be daft, and that may explain why no one uses the earlier 
one.  is_utf8_string(s, len) works for a single character as well as a 
whole string, and is typically what people want.

But for the case where someone knows what the buffer size is, the 
alternative function is trivial to provide.

There is one call to is_utf8_char() in the core.  It is used by the 
common routine to all of things like is_utf8_alpha(), etc. Rather than 
change the APIs for all these,  I'm proposing that we say that these 
should be called only on already-validated data.  We could then delete 
the call, or change it to the new function as

  is_utf8_char(p, UTF8SIZE(p))

to get all the validation that doesn't involve buffer overflow





Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About