Front page | perl.perl5.porters |
Postings from July 2022
Re: Pre-RFC: New C API for converting from UTF-8 to code point
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
July 1, 2022 21:32
Subject:
Re: Pre-RFC: New C API for converting from UTF-8 to code point
Message ID:
cf3ae547-f17d-0d02-8727-b036b6bacd40@khwilliamson.com
On 6/29/22 17:56, hv@crypt.org wrote:
> Karl Williamson <public@khwilliamson.com> wrote:
> :On 6/28/22 20:27, hv@crypt.org wrote:
> :> Karl Williamson <public@khwilliamson.com> wrote:
> :> : Then the loop innards would look like:
> :> :
> :> : while (s < e) {
> :> : code_point = next_uvchr(s, e, &retlen);
> :> :
> :> : if (retlen < 0) {
> :>
> :> I'd recommend making this `if (retlen <= 0) {` or otherwise handling
> :> the retlen == 0 case, even if you only expect that when s >= e: the
> :> function should be capable of doing something reasonable on an empty
> :> string, which probably means not croaking.
> :>
> :> : ... process error ...
> :> : retlen = -retlen;
> :> : }
> :
> :It is currently illegal to call the existing functions with zero length
> :input. The functions don't now croak on zero length input, but they do
> :assert against it, which on DEBUGGING builds is pretty much the same thing.
>
> Ok, let's hope module writers habitually develop with a debug build.
I would be amenable to croaking if called improperly outside debugging
builds. I'm not open to the function ever returning 0 length. It's
better to croak than loop indefinitely. I've been there, done that,
with the current interface.
>
> And in another reply:
> :The main difference would be that the return code doesn't have to be
> :disambiguated. Right now you can't tell whether the input was a NUL
> :character or an error, without testing each return. cpan modules don't
> :tend to do this, leaving them vulnerable. This API is so that you can
> :be naive and still be safe.
>
> How naive do you want? Anyone that fails to check for negative retlen
> can fail quite unsafely.
I'm not clear as to what you are thinking here. By safe, I mean
sanitized, returning a value that won't be interpreted wrongly due to
invalid UTF-8. This could be from a noisy line, an unintentional error,
or a deliberate attack. The function very deliberately is designed so
that, if called in the prescribed manner, the caller doesn't have to
concern itself with those things. The input is not what it should be,
but it has been sanitized to not be misinterpretable. Unlike parsing a
string to calculate a number, the REPLACEMENT CHARACTER is a value
reserved by Unicode to have just this particular meaning. It has no
other use.
Most code wouldn't know what to do in the event of an error.
Serializers, for example, just pass through the data. It's going to not
be the original, but it will be sanitized. There's generally no real
added value to them knowing there was an error, and having to deal with
it. Discarding an invalid value has been shown to make one vulnerable
to attack. The only real option is then to use the REPLACEMENT
CHARACTER, which is what this interface does with no muss or fuss on the
part of the caller.
Code that reads from a buffer that is being filled by another process
does however want to look for the specific error of the final character
so far being incomplete. Such code would call the _msgs form of the
function and if it is that specific error, do something like sleep and
try again later. For all other errors, it would just accept the
proffered REPLACEMENT CHARACTER.
Or higher level code could ask for a retransmission. The interface
allows for code to get the necessary information. But most code really
can't do anything. Take pattern matching. If the pattern is looking
for a particular character, and it gets a malformed one, it doesn't know
what the intended one was, nor does it have a way to ask for the value
again. If the pattern was to match '.' at this point, it will match,
because the whatever the intended character was would match. The code
in the regex matching engine needn't have to be made more complicated to
deal with something that it can't do any better
>
> I think we've established in previous discussion that you are more
> comfortable with this sort of overloaded interface than I am, so I
> do not wish to belabour the point. But this does rather remind me of
> https://github.com/Perl/perl5/issues/14498, where pretty much every
> use of grok_atou in core was buggy. I do recommend reading through
> that discussion again to see if there's stuff worth learning that
> could be applicable to this API.
As a counter example, in that discussion you refer approvingly to the
grok_bslash_x API. IIRC, it was me who came up with that interface,
precisely to solve the problem of getting the wrong number silently, or
even with a warning, but still wrongly acted upon.
But that is a different problem. The REPLACEMENT CHARACTER is a
sentinel that blares out that this string is problematic.
From reading that discussion, I think you would be more comfortable
with something like this:
while (s < e) {
if (! next_uvchr(s, e, &ret_cp, &ret_len)) {
... process error ...
}
else {
... process code_point ...
}
s += ret_len;
}
or if the caller didn't care to handle errors
while (s < e) {
(void) next_uvchr(s, e, &ret_cp, &ret_len);
... process code_point ...
s += ret_len;
}
I would consider this. I designed the original proposal to be more of a
drop-in replacement for the current one. But this one gets rid of the
flag to say you want errors, and might be close enough to drop-in. I
await more people to weigh-in.
>
> Hugo
Thread Previous
|
Thread Next