develooper Front page | perl.perl5.porters | Postings from March 2021

Re: SvPVutf8 validity

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
March 22, 2021 11:35
Subject:
Re: SvPVutf8 validity
Message ID:
EE03DF51-5389-4C82-945A-D84E7B23773C@felipegasper.com


> On Mar 22, 2021, at 7:23 AM, Tony Cook <tony@develop-help.com> wrote:
> 
> On Mon, Mar 22, 2021 at 06:54:00AM -0400, Felipe Gasper wrote:
>> 
>>> On Mar 21, 2021, at 11:20 PM, Tony Cook <tony@develop-help.com> wrote:
>>> 
>>> On Sun, Mar 21, 2021 at 11:02:25PM -0400, Felipe Gasper wrote:
>>>> Hello,
>>>> 
>>>> Does SvPVutf8 have the same UTF-8 validity problems as Encode::encode_utf8()?
>>> 
>>> It returns the internal UTF-8 encoding, which can include surrogates,
>>> etc.
>>> 
>>> If that's not what concerns you, please be more specific.
>> 
>> Yeah, that’s it: SvPVutf8 is Perl’s internal “lax” utf8 rather than official, valid UTF-8. So SvPVutf8 will happily encode code points that UTF-8 forbids, e.g., "\x{ffff}".
> 
> Unicode explicitly permits noncharacters for internal use and doesn't
> forbid them for interchange, from 23.7 Noncharacters:
> 
>  Applications are free to use any of these noncharacter code points
>  internally. They have no standard interpretation when exchanged
>  outside the context of internal use. However, they are not illegal in
>  interchange, nor does their presence cause Unicode text to be
>  ill-formed.
> 
> There is a "should not" for public exchange in 3.2 Conformance
> Requirements, but they aren't forbidden.

Ah ok. Thank you--I hadn’t read the spec but was going off of is_strict_utf8_string()’s description in perlapi.

> 
>> It sounds, then, like XS modules that speak UTF-8 to external libraries should normally pass SvPVutf8’s output through is_strict_utf8_string() or a variant? So this would be a (documentation-worthy?) caveat of using SvPVutf8.
> 
> A bigger deal would be using "supers" or characters beyond U+10FFFF.
> 
> But I think this type of issue should be dealt with on input - don't
> allow these characters into your strings in the first place, and
> SvPVutf8() won't return their encoded forms.

Most discussions I see about character encoding in Perl consider Encode::encode_utf8() to be improper/less-than-ideal because it outputs “lax” UTF-8. (Encode.pm’s own docs, for example.) Given that SvPVutf8 is essentially the same encoding logic, would the same propriety apply?

Thank you!

-F
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About