develooper Front page | perl.perl5.porters | Postings from March 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
From:
Yuki Kimoto
Date:
March 31, 2021 06:33
Subject:
Re: Perl 7: Fix string leaks?
Message ID:
CAExogxMnP3wNebLWdq054r3Z-idbG7a6U6mkfpjAUOwfNvT+uA@mail.gmail.com
Felipe Gasper

thank you. I understand little by little.


2021年3月30日(火) 19:45 Felipe Gasper <felipe@felipegasper.com>:

>
> > On Mar 30, 2021, at 3:20 AM, Dan Book <grinnz@gmail.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 3:07 AM Yuki Kimoto <kimoto.yuki@gmail.com>
> wrote:
> > I still don't understand this problem.
> >
> > Is 128-255(latin-1 and UTF-8 shared) range problem?
> >
> > More specifically the problem is that those codepoints are both used by
> encodings such as latin-1 and UTF-8 bytes, and also are valid unicode
> codepoints (which are mostly the same as the characters they encode as
> latin-1 bytes). This, with Perl's string design, means the program cannot
> know whether a string intends to contain decoded text or encoded bytes.
>
> To flesh this out a bit:
>
> Consider the following XSUB:
>
> void
> printit (char * thestring)
>   CODE:
>     fprintf(stdout, thestring);
>
> Now consider the following Perl:
>
> my $str_a = "\xff";
> printit($str);
>
> It’ll print one byte, right? Now consider this Perl:
>
> my $str_b = "\xff\x{100}";
> chop $str_b;
> printit($str);
>
> That will print TWO bytes. This despite that $str_b eq’s $str_a.
>
> The problem is that `char *` in XS’s default typemap uses SvPV. Ordinarily
> this is fine as long as you check the SV’s SvUTF8, but the typemap doesn’t
> do that, so our C code has no way to know how many actual characters the
> original Perl string contains. In this case, $str_b’s PV is UTF-8, while
> $str_a is Latin-1, which means that, although both Perl strings contain
> code point 255, SvPV will yield different C strings for them.
>
> The fix here is to switch the typemap to SvPVbyte so that identical Perl
> strings will yield identical C representation.
>
> That was a somewhat contrived example; a more realistic one might be:
>
> my $str_c = Encode::Simple::decode_utf8('ÆÆÆ');
> printit($str_c);
>
> Here, printit() does something people might realistically expect: its use
> of the internal PV effects a de facto “auto-UTF-8-encode”. If the default
> typemap were SvPVbyte here, printit() would throw an exception, thus
> breaking the application. The Perl application *should* encode_utf8() for
> itself, but lots of Perl code out there probably depends on this
> auto-encode behaviour.
>
>
> The same bug exists if you mkdir($str_a) and mkdir($str_b). See
> Sys::Binmode for a CPAN fix. I hope to convince enough folks here to bring
> that into core eventually.
>
>
> -F

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About