develooper Front page | perl.perl5.porters | Postings from March 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
Felipe Gasper
March 30, 2021 10:45
Re: Perl 7: Fix string leaks?
Message ID:

> On Mar 30, 2021, at 3:20 AM, Dan Book <> wrote:
> On Tue, Mar 30, 2021 at 3:07 AM Yuki Kimoto <> wrote:
> I still don't understand this problem.
> Is 128-255(latin-1 and UTF-8 shared) range problem?
> More specifically the problem is that those codepoints are both used by encodings such as latin-1 and UTF-8 bytes, and also are valid unicode codepoints (which are mostly the same as the characters they encode as latin-1 bytes). This, with Perl's string design, means the program cannot know whether a string intends to contain decoded text or encoded bytes.

To flesh this out a bit:

Consider the following XSUB:

printit (char * thestring)
    fprintf(stdout, thestring);

Now consider the following Perl:

my $str_a = "\xff";

It’ll print one byte, right? Now consider this Perl:

my $str_b = "\xff\x{100}";
chop $str_b;

That will print TWO bytes. This despite that $str_b eq’s $str_a.

The problem is that `char *` in XS’s default typemap uses SvPV. Ordinarily this is fine as long as you check the SV’s SvUTF8, but the typemap doesn’t do that, so our C code has no way to know how many actual characters the original Perl string contains. In this case, $str_b’s PV is UTF-8, while $str_a is Latin-1, which means that, although both Perl strings contain code point 255, SvPV will yield different C strings for them.

The fix here is to switch the typemap to SvPVbyte so that identical Perl strings will yield identical C representation.

That was a somewhat contrived example; a more realistic one might be:

my $str_c = Encode::Simple::decode_utf8('ÆÆÆ');

Here, printit() does something people might realistically expect: its use of the internal PV effects a de facto “auto-UTF-8-encode”. If the default typemap were SvPVbyte here, printit() would throw an exception, thus breaking the application. The Perl application *should* encode_utf8() for itself, but lots of Perl code out there probably depends on this auto-encode behaviour.

The same bug exists if you mkdir($str_a) and mkdir($str_b). See Sys::Binmode for a CPAN fix. I hope to convince enough folks here to bring that into core eventually.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About