develooper Front page | perl.perl5.porters | Postings from March 2021

Re: SvPVutf8 validity

Thread Previous | Thread Next
Felipe Gasper
March 22, 2021 11:35
Re: SvPVutf8 validity
Message ID:

> On Mar 22, 2021, at 7:23 AM, Tony Cook <> wrote:
> On Mon, Mar 22, 2021 at 06:54:00AM -0400, Felipe Gasper wrote:
>>> On Mar 21, 2021, at 11:20 PM, Tony Cook <> wrote:
>>> On Sun, Mar 21, 2021 at 11:02:25PM -0400, Felipe Gasper wrote:
>>>> Hello,
>>>> Does SvPVutf8 have the same UTF-8 validity problems as Encode::encode_utf8()?
>>> It returns the internal UTF-8 encoding, which can include surrogates,
>>> etc.
>>> If that's not what concerns you, please be more specific.
>> Yeah, that’s it: SvPVutf8 is Perl’s internal “lax” utf8 rather than official, valid UTF-8. So SvPVutf8 will happily encode code points that UTF-8 forbids, e.g., "\x{ffff}".
> Unicode explicitly permits noncharacters for internal use and doesn't
> forbid them for interchange, from 23.7 Noncharacters:
>  Applications are free to use any of these noncharacter code points
>  internally. They have no standard interpretation when exchanged
>  outside the context of internal use. However, they are not illegal in
>  interchange, nor does their presence cause Unicode text to be
>  ill-formed.
> There is a "should not" for public exchange in 3.2 Conformance
> Requirements, but they aren't forbidden.

Ah ok. Thank you--I hadn’t read the spec but was going off of is_strict_utf8_string()’s description in perlapi.

>> It sounds, then, like XS modules that speak UTF-8 to external libraries should normally pass SvPVutf8’s output through is_strict_utf8_string() or a variant? So this would be a (documentation-worthy?) caveat of using SvPVutf8.
> A bigger deal would be using "supers" or characters beyond U+10FFFF.
> But I think this type of issue should be dealt with on input - don't
> allow these characters into your strings in the first place, and
> SvPVutf8() won't return their encoded forms.

Most discussions I see about character encoding in Perl consider Encode::encode_utf8() to be improper/less-than-ideal because it outputs “lax” UTF-8. (’s own docs, for example.) Given that SvPVutf8 is essentially the same encoding logic, would the same propriety apply?

Thank you!

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About