Front page | perl.perl5.porters |
Postings from June 2022
Pre-RFC: New C API for converting from UTF-8 to code point
Thread Next
From:
Karl Williamson
Date:
June 28, 2022 15:18
Subject:
Pre-RFC: New C API for converting from UTF-8 to code point
Message ID:
06897ebc-aa9b-556e-8b0e-ff50d3f903d6@khwilliamson.com
In response to GH #19897 and GH #19842, I think we need to come up with
a better API to replace the deprecated functions.
One of the issues with the existing API is that the behavior changes
depending on whether warnings are enabled or not; something usually
outside the purview of a module author. There's also the problem in
some cases of having to disambiguate the return being successful or not.
To that end, I'm proposing the following API. For the current draft,
I'm using the name 'next_uvchr'; suggestions welcome
Most code would do the following to process input:
const U8 * s;
const U8 * e;
UV code_point;
PERL_INT_FAST8_T retlen;
while (s < e) {
code_point = next_uvchr(s, e, &retlen);
... process code_point ...
s += retlen;
}
This loop would safely go through the string of bytes s .. e-1, assumed
to be intended to be encoded as Perl extended UTF-8, converting the next
UTF-8 encoded character to its code point equivalent, and storing into
retlen the number of bytes that character occupies.
The program would not have to concern itself with malformed input; the
function would take care of that by itself, returning REPLACEMENT
CHARACTER for each malformed sequence, and setting retlen to be the
offset of the starting position of the next potentially legal character.
If utf8 warnings are on, those would be raised for each iteration that
found a malformation.
If you don't want to consume the input string, just pass NULL instead of
&retlen, or perhaps there could be
peek_uvchr(s, e)
Those few programs that want more control could use
next_uvchr(s, e, &retlen, flags)
'flags' would be any of the ones accepted by utf8n_to_uvchr(), as
documented in perlapi, with the addition of
UTF8_RETURN_NEGATIVE_LENGTH_ON_ERROR (or some such name)
This would change the function to set into retlen the negative value of
how many bytes it consumed, if and only if this character was malformed.
Then the loop innards would look like:
while (s < e) {
code_point = next_uvchr(s, e, &retlen);
if (retlen < 0) {
... process error ...
retlen = -retlen;
}
else {
... process code_point ...
}
s += retlen; // or abs(retlen)
}
The reason for this is that the typical naive handling is safe, and the
same function signature would work for more refined handling.
If you wanted to have complete control of error handling, there would be
AV *msgs;
next_uvchr_msgs(s, e, &retlen, flags, &msgs)
corresponding to the existing utf8n_to_uvchr_msgs(),
ppport.h would port all these back to 5.6.1
Thread Next
-
Pre-RFC: New C API for converting from UTF-8 to code point
by Karl Williamson