develooper Front page | perl.perl5.porters | Postings from July 2022

Re: Pre-RFC: New C API for converting from UTF-8 to code point

Thread Previous | Thread Next
Karl Williamson
July 1, 2022 21:32
Re: Pre-RFC: New C API for converting from UTF-8 to code point
Message ID:
On 6/29/22 17:56, wrote:
> Karl Williamson <> wrote:
> :On 6/28/22 20:27, wrote:
> :> Karl Williamson <> wrote:
> :> :  Then the loop innards would look like:
> :> :
> :> :     while (s < e) {
> :> :         code_point = next_uvchr(s, e, &retlen);
> :> :
> :> :         if (retlen < 0) {
> :>
> :> I'd recommend making this `if (retlen <= 0) {` or otherwise handling
> :> the retlen == 0 case, even if you only expect that when s >= e: the
> :> function should be capable of doing something reasonable on an empty
> :> string, which probably means not croaking.
> :>
> :> :             ... process error ...
> :> :             retlen = -retlen;
> :> :         }
> :
> :It is currently illegal to call the existing functions with zero length
> :input.  The functions don't now croak on zero length input, but they do
> :assert against it, which on DEBUGGING builds is pretty much the same thing.
> Ok, let's hope module writers habitually develop with a debug build.

I would be amenable to croaking if called improperly outside debugging 
builds.  I'm not open to the function ever returning 0 length.  It's 
better to croak than loop indefinitely.  I've been there, done that, 
with the current interface.
> And in another reply:
> :The main difference would be that the return code doesn't have to be
> :disambiguated.  Right now you can't tell whether the input was a NUL
> :character or an error, without testing each return.  cpan modules don't
> :tend to do this, leaving them vulnerable.  This API is so that you can
> :be naive and still be safe.
> How naive do you want? Anyone that fails to check for negative retlen
> can fail quite unsafely.

I'm not clear as to what you are thinking here.  By safe, I mean 
sanitized, returning a value that won't be interpreted wrongly due to 
invalid UTF-8.  This could be from a noisy line, an unintentional error, 
or a deliberate attack.  The function very deliberately is designed so 
that, if called in the prescribed manner, the caller doesn't have to 
concern itself with those things.  The input is not what it should be, 
but it has been sanitized to not be misinterpretable.  Unlike parsing a 
string to calculate a number, the REPLACEMENT CHARACTER is a value 
reserved by Unicode to have just this particular meaning.  It has no 
other use.

Most code wouldn't know what to do in the event of an error. 
Serializers, for example, just pass through the data.  It's going to not 
be the original, but it will be sanitized.  There's generally no real 
added value to them knowing there was an error, and having to deal with 
it.  Discarding an invalid value has been shown to make one vulnerable 
to attack.  The only real option is then to use the REPLACEMENT 
CHARACTER, which is what this interface does with no muss or fuss on the 
part of the caller.

Code that reads from a buffer that is being filled by another process 
does however want to look for the specific error of the final character 
so far being incomplete.  Such code would call the _msgs form of the 
function and if it is that specific error, do something like sleep and 
try again later.  For all other errors, it would just accept the 

Or higher level code could ask for a retransmission.  The interface 
allows for code to get the necessary information.  But most code really 
can't do anything.  Take pattern matching.  If the pattern is looking 
for a particular character, and it gets a malformed one, it doesn't know 
what the intended one was, nor does it have a way to ask for the value 
again.  If the pattern was to match '.' at this point, it will match, 
because the whatever the intended character was would match.  The code 
in the regex matching engine needn't have to be made more complicated to 
deal with something that it can't do any better

> I think we've established in previous discussion that you are more
> comfortable with this sort of overloaded interface than I am, so I
> do not wish to belabour the point. But this does rather remind me of
>, where pretty much every
> use of grok_atou in core was buggy. I do recommend reading through
> that discussion again to see if there's stuff worth learning that
> could be applicable to this API.

As a counter example, in that discussion you refer approvingly to the 
grok_bslash_x API.  IIRC, it was me who came up with that interface, 
precisely to solve the problem of getting the wrong number silently, or 
even with a warning, but still wrongly acted upon.

But that is a different problem.  The REPLACEMENT CHARACTER is a 
sentinel that blares out that this string is problematic.

 From reading that discussion, I think you would be more comfortable 
with something like this:

     while (s < e) {
         if (! next_uvchr(s, e, &ret_cp, &ret_len)) {
             ... process error ...
         else {
             ... process code_point ...

         s += ret_len;

or if the caller didn't care to handle errors

     while (s < e) {
         (void) next_uvchr(s, e, &ret_cp, &ret_len);

         ... process code_point ...

         s += ret_len;

I would consider this.  I designed the original proposal to be more of a 
drop-in replacement for the current one.  But this one gets rid of the 
flag to say you want errors, and might be close enough to drop-in.  I 
await more people to weigh-in.

> Hugo

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About