develooper Front page | perl.perl5.porters | Postings from March 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
March 30, 2021 10:45
Subject:
Re: Perl 7: Fix string leaks?
Message ID:
783B6DD4-AF25-450E-844E-59B0A59FA1C9@felipegasper.com

> On Mar 30, 2021, at 3:20 AM, Dan Book <grinnz@gmail.com> wrote:
> 
> On Tue, Mar 30, 2021 at 3:07 AM Yuki Kimoto <kimoto.yuki@gmail.com> wrote:
> I still don't understand this problem.
> 
> Is 128-255(latin-1 and UTF-8 shared) range problem?
> 
> More specifically the problem is that those codepoints are both used by encodings such as latin-1 and UTF-8 bytes, and also are valid unicode codepoints (which are mostly the same as the characters they encode as latin-1 bytes). This, with Perl's string design, means the program cannot know whether a string intends to contain decoded text or encoded bytes.

To flesh this out a bit:

Consider the following XSUB:

void
printit (char * thestring)
  CODE:
    fprintf(stdout, thestring);

Now consider the following Perl:

my $str_a = "\xff";
printit($str);

It’ll print one byte, right? Now consider this Perl:

my $str_b = "\xff\x{100}";
chop $str_b;
printit($str);

That will print TWO bytes. This despite that $str_b eq’s $str_a.

The problem is that `char *` in XS’s default typemap uses SvPV. Ordinarily this is fine as long as you check the SV’s SvUTF8, but the typemap doesn’t do that, so our C code has no way to know how many actual characters the original Perl string contains. In this case, $str_b’s PV is UTF-8, while $str_a is Latin-1, which means that, although both Perl strings contain code point 255, SvPV will yield different C strings for them.

The fix here is to switch the typemap to SvPVbyte so that identical Perl strings will yield identical C representation.

That was a somewhat contrived example; a more realistic one might be:

my $str_c = Encode::Simple::decode_utf8('ÆÆÆ');
printit($str_c);

Here, printit() does something people might realistically expect: its use of the internal PV effects a de facto “auto-UTF-8-encode”. If the default typemap were SvPVbyte here, printit() would throw an exception, thus breaking the application. The Perl application *should* encode_utf8() for itself, but lots of Perl code out there probably depends on this auto-encode behaviour.


The same bug exists if you mkdir($str_a) and mkdir($str_b). See Sys::Binmode for a CPAN fix. I hope to convince enough folks here to bring that into core eventually.


-F
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About