develooper Front page | perl.perl5.porters | Postings from June 2022

Pre-RFC: New C API for converting from UTF-8 to code point

Thread Next
From:
Karl Williamson
Date:
June 28, 2022 15:18
Subject:
Pre-RFC: New C API for converting from UTF-8 to code point
Message ID:
06897ebc-aa9b-556e-8b0e-ff50d3f903d6@khwilliamson.com
In response to GH #19897 and GH #19842, I think we need to come up with 
a better API to replace the deprecated functions.

One of the issues with the existing API is that the behavior changes 
depending on whether warnings are enabled or not; something usually 
outside the purview of a module author.  There's also the problem in 
some cases of having to disambiguate the return being successful or not.

To that end, I'm proposing the following API.  For the current draft, 
I'm using the name 'next_uvchr'; suggestions welcome

Most code would do the following to process input:
     const U8 *        s;
     const U8 *        e;
     UV                code_point;
     PERL_INT_FAST8_T  retlen;

     while (s < e) {
         code_point = next_uvchr(s, e, &retlen);

         ... process code_point ...

         s += retlen;
     }

This loop would safely go through the string of bytes s ..  e-1, assumed 
to be intended to be encoded as Perl extended UTF-8, converting the next 
UTF-8 encoded character to its code point equivalent, and storing into 
retlen the number of bytes that character occupies.

The program would not have to concern itself with malformed input; the 
function would take care of that by itself, returning REPLACEMENT 
CHARACTER for each malformed sequence, and setting retlen to be the 
offset of the starting position of the next potentially legal character. 
  If utf8 warnings are on, those would be raised for each iteration that 
found a malformation.

If you don't want to consume the input string, just pass NULL instead of 
&retlen, or perhaps there could be

     peek_uvchr(s, e)

Those few programs that want more control could use

     next_uvchr(s, e, &retlen, flags)

'flags' would be any of the ones accepted by utf8n_to_uvchr(), as 
documented in perlapi, with the addition of

     UTF8_RETURN_NEGATIVE_LENGTH_ON_ERROR  (or some such name)

This would change the function to set into retlen the negative value of 
how many bytes it consumed, if and only if this character was malformed. 
  Then the loop innards would look like:

     while (s < e) {
         code_point = next_uvchr(s, e, &retlen);

         if (retlen < 0) {
             ... process error ...
             retlen = -retlen;
         }
         else {
             ... process code_point ...
         }

         s += retlen;    // or abs(retlen)
     }

The reason for this is that the typical naive handling is safe, and the 
same function signature would work for more refined handling.

If you wanted to have complete control of error handling, there would be

     AV *msgs;
     next_uvchr_msgs(s, e, &retlen, flags, &msgs)

corresponding to the existing utf8n_to_uvchr_msgs(),

ppport.h would port all these back to 5.6.1

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About