develooper Front page | perl.perl5.porters | Postings from March 2021

Re: SvPVutf8 validity

Thread Previous | Thread Next
Tony Cook
March 22, 2021 11:24
Re: SvPVutf8 validity
Message ID:
On Mon, Mar 22, 2021 at 06:54:00AM -0400, Felipe Gasper wrote:
> > On Mar 21, 2021, at 11:20 PM, Tony Cook <> wrote:
> > 
> > On Sun, Mar 21, 2021 at 11:02:25PM -0400, Felipe Gasper wrote:
> >> Hello,
> >> 
> >> Does SvPVutf8 have the same UTF-8 validity problems as Encode::encode_utf8()?
> > 
> > It returns the internal UTF-8 encoding, which can include surrogates,
> > etc.
> > 
> > If that's not what concerns you, please be more specific.
> Yeah, that’s it: SvPVutf8 is Perl’s internal “lax” utf8 rather than official, valid UTF-8. So SvPVutf8 will happily encode code points that UTF-8 forbids, e.g., "\x{ffff}".

Unicode explicitly permits noncharacters for internal use and doesn't
forbid them for interchange, from 23.7 Noncharacters:

  Applications are free to use any of these noncharacter code points
  internally. They have no standard interpretation when exchanged
  outside the context of internal use. However, they are not illegal in
  interchange, nor does their presence cause Unicode text to be

There is a "should not" for public exchange in 3.2 Conformance
Requirements, but they aren't forbidden.

> It sounds, then, like XS modules that speak UTF-8 to external libraries should normally pass SvPVutf8’s output through is_strict_utf8_string() or a variant? So this would be a (documentation-worthy?) caveat of using SvPVutf8.

A bigger deal would be using "supers" or characters beyond U+10FFFF.

But I think this type of issue should be dealt with on input - don't
allow these characters into your strings in the first place, and
SvPVutf8() won't return their encoded forms.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About