develooper Front page | perl.perl5.porters | Postings from June 2022

Re: Pre-RFC: New C API for converting from UTF-8 to code point

Thread Previous
From:
Karl Williamson
Date:
June 29, 2022 22:39
Subject:
Re: Pre-RFC: New C API for converting from UTF-8 to code point
Message ID:
be52f4d5-5c12-7e32-d528-8d71bf2ddbd5@khwilliamson.com
On 6/28/22 19:55, Tony Cook wrote:
> On Tue, Jun 28, 2022 at 09:17:50AM -0600, Karl Williamson wrote:
>> In response to GH #19897 and GH #19842, I think we need to come up with a
>> better API to replace the deprecated functions.
>>
>> One of the issues with the existing API is that the behavior changes
>> depending on whether warnings are enabled or not; something usually outside
>> the purview of a module author.  There's also the problem in some cases of
>> having to disambiguate the return being successful or not.
> 
> Are these intended to replace just the deprecated functions, or also
> as an easier to use version of utf8n_to_uvchr() and its variants?

I would want to replace core calls with these, which would simplify the 
logic in most places.
> 
> It might be worth describing how the new APIs differ from the existing
> non-deprecated APIs.  The obvious difference is start/end vs
> start/length, but I think error reporting is handled differently too.

I can add such a description.  Some existing functions already have 
start/end pointers rather than start/length parameters.  I have found 
through unhappy experience that it is quite possible for naive functions 
to end up calling another with something that should be negative, but 
ends up being treated as an extremely large positive number.  Using 
SSize_t avoids that, but cuts the permissible max length in half.

And some of the current functions that do take a Size_t parameter 
promptly add it to the start so they can do 'while (s < e)', which leads 
to more elegant loops.  So it is slightly more efficient generally to 
use the start/end form.  If you need to keep the original starting 
position, creating a s0 variable initialized to that works well.

The main difference would be that the return code doesn't have to be 
disambiguated.  Right now you can't tell whether the input was a NUL 
character or an error, without testing each return.  cpan modules don't 
tend to do this, leaving them vulnerable.  This API is so that you can 
be naive and still be safe.

The other major difference is that the state of 'use warnings' doesn't 
affect what the function returns; that it now does is quite problematic.


> 
>> The program would not have to concern itself with malformed input; the
>> function would take care of that by itself, returning REPLACEMENT CHARACTER
>> for each malformed sequence, and setting retlen to be the offset of the
>> starting position of the next potentially legal character.  If utf8 warnings
>> are on, those would be raised for each iteration that found a malformation.
> 
> Would there be a simple way to prevent this API producing warnings?

I had been considering adding a flag to override the state of 'use 
warnings' to turn them off unconditionally (but not the inverse).  But 
the API does already offer that capability without the flag, and you can 
get the inverse as well.  If you use the '_msgs' form, it doesn't raise 
any warnings, but returns a hash of the ones it otherwise would have. 
You can discard the hash, or translate it into the language of your 
choice, or output it yourself, regardless of 'use warnings'.  The hash 
also lets you know precisely what the malformation was.

There certainly could be a flag to just not raise any warnings.

> 
> Tony


Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About