Front page | perl.perl5.porters |
Postings from March 2021
Re: Perl 7: Fix string leaks?
March 30, 2021 20:48
Re: Perl 7: Fix string leaks?
Message ID: email@example.com
On 30/3/21 20:53, Felipe Gasper wrote:
>> On Mar 30, 2021, at 12:16 PM, Salvador Fandiño <firstname.lastname@example.org> wrote:
>> On 30/3/21 16:39, Felipe Gasper wrote:
>>>> On Mar 30, 2021, at 10:26 AM, Salvador Fandiño <email@example.com> wrote:
>>>> On 30/3/21 12:45, Felipe Gasper wrote:
>>>>> The fix here is to switch the typemap to SvPVbyte so that identical Perl strings will yield identical C representation.
>>>> Any XS code that is using SvPV to convert SVs to char* is already broken.
>>>> IMO, the default typemap could be changed right now.
>>> Wouldn’t that break a great many applications which currently pass decoded strings to XSUBs?
>> You mean ensuring from the Perl side that any SV has the UTF8 flag set before passing it to some XSUB, right?
> Not quite.
> From your comments it sounds like you would map `char *` to `SvPVutf8_nolen`. That would break apps that either pre-encode their strings before giving them to XSUBs or that skip character decoding.
No, no, I have not explained well. You said "decode", and for me in the
Perl context that means utf-8.
Anyway, what I mean is that the XS programmer should actually choose
between SvPVutf8 or SvPVbyte (or any of the variations). The first makes
sense when the XSUB requires utf-8 encoded data, the second one when it
requires binary data or data in some other encoding (then the developer
needs to do the decoding himself in some way).
> Alternatively, if you mean--as I propose--making `char *` map to `SvPVbyte_nolen`, that will break apps that *don’t* pre-encode their strings.
Yes, but my point is that those are already broken in almost all cases.
If you call SvPV on a SV outside your control you can get back some
string encoded as latin1 or the same string encoded as utf-8.
The only case when that doesn't happen is when the developer explicitly
calls utf8::upgrade every time just before calling the XSUB or well,
using some other way to ensure that the UTF8 flag is always set.
> The status quo--SvPV_nolen--kind of serves both use cases, but unreliably so: it’s possible for a decoded string to be downgraded, and it’s possible for an encoded string to be upgraded. In either of those cases, SvPV will probably not yield the desired C string.
Yes, that's why I am also saying that if should not be used as the
default typemap for char*.
I agree with you that the default should be SvPVbyte.
>> No solution is trivial or evident, and would have required investigation from the developer. So, I would expect most people did find about 2 and used it.
> A lot of XS modules use SvPV without checking SvUTF8. Alas.
Yes, and almost all of them are broken!
>> Also, if you make the default typemap croak if the data can not be encoded, that would make any broken code very easy to detect. Any programmer which have adopted solution 1, would find pretty soon that something is broken in his code.
> SvPVbyte will make it easy to see what’s broken, sure, but someone will still need to go in and fix it. That “someone” could be the programmer who hates Perl and keeps hounding management to approve a rewrite in some other more popular language. If it suddenly breaks, that case is much easier to make.
Yes that's true, but they were already broken in the first place. You
are only making that more visible.
That same Perl hater could also downplay Perl by saying that it is not
able to handle UTF-8 encoded data properly. Just that this time instead
of perl telling him that something is wrong in his program as soon as it
is detected, it is hiding it, and making it an obscure and annoying bug
very difficult to trouble shoot.
That's no way to make friends either!
>> Encoding/decoding should be done at the boundaries (syscall, XS, etc.) using sensible defaults and/or allowing the user to set them (as in PerlIO). That would IMO fix most of the problems.
> “Encoding at the boundaries” is essentially what Sys::Binmode achieves for POSIX OSes, FWIW.
Yeah, that part is right! what is wrong is that it uses the wrong
encoding (latin1) most of the time!!!
Actually, I think it would be pretty easy to modify your module so that
it gets the encoding from the environment or alternatively taking it as
a parameter at least on UNIX and alikes:
use Sys::Binmode "latin5";
Windows is a different story...