develooper Front page | perl.perl5.porters | Postings from March 2021

Re: SvPVutf8 validity

Thread Previous | Thread Next
From:
Karl Williamson
Date:
March 22, 2021 15:53
Subject:
Re: SvPVutf8 validity
Message ID:
802cf8f8-4e8d-9bef-452b-9e49c4548210@khwilliamson.com
On 3/22/21 5:23 AM, Tony Cook wrote:
> On Mon, Mar 22, 2021 at 06:54:00AM -0400, Felipe Gasper wrote:
>>
>>> On Mar 21, 2021, at 11:20 PM, Tony Cook <tony@develop-help.com> wrote:
>>>
>>> On Sun, Mar 21, 2021 at 11:02:25PM -0400, Felipe Gasper wrote:
>>>> Hello,
>>>>
>>>> Does SvPVutf8 have the same UTF-8 validity problems as Encode::encode_utf8()?
>>>
>>> It returns the internal UTF-8 encoding, which can include surrogates,
>>> etc.
>>>
>>> If that's not what concerns you, please be more specific.
>>
>> Yeah, that’s it: SvPVutf8 is Perl’s internal “lax” utf8 rather than official, valid UTF-8. So SvPVutf8 will happily encode code points that UTF-8 forbids, e.g., "\x{ffff}".
> 
> Unicode explicitly permits noncharacters for internal use and doesn't
> forbid them for interchange, from 23.7 Noncharacters:
> 
>    Applications are free to use any of these noncharacter code points
>    internally. They have no standard interpretation when exchanged
>    outside the context of internal use. However, they are not illegal in
>    interchange, nor does their presence cause Unicode text to be
>    ill-formed.
> 
> There is a "should not" for public exchange in 3.2 Conformance
> Requirements, but they aren't forbidden.
> 
>> It sounds, then, like XS modules that speak UTF-8 to external libraries should normally pass SvPVutf8’s output through is_strict_utf8_string() or a variant? So this would be a (documentation-worthy?) caveat of using SvPVutf8.
> 
> A bigger deal would be using "supers" or characters beyond U+10FFFF.
> 
> But I think this type of issue should be dealt with on input - don't
> allow these characters into your strings in the first place, and
> SvPVutf8() won't return their encoded forms.
> 
> Tony
> 


Note the existence (from perlapi) of:

is_c9strict_utf8_string
              Returns TRUE if the first "len" bytes of string "s" form a 
valid
              UTF-8-encoded string that conforms to Unicode Corrigendum #9
              <http://www.unicode.org/versions/corrigendum9.html>; otherwise
              it returns FALSE. If "len" is 0, it will be calculated using
              strlen(s) (which means if you use this option, that "s" can't
              have embedded "NUL" characters and has to have a terminating
              "NUL" byte). Note that all characters being ASCII 
constitute 'a
              valid UTF-8 string'.

              This function returns FALSE for strings containing any code
              points above the Unicode max of 0x10FFFF or surrogate code
              points, but accepts non-character code points per 
Corrigendum #9
              <http://www.unicode.org/versions/corrigendum9.html>.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About