develooper Front page | perl.perl5.porters | Postings from March 2021

Re: Perl 7: Fix string leaks? Perl5 Porters <perl5-porters@perl.org>

Thread Previous | Thread Next
From:
=?UTF-8?Q?Salvador_Fandi=c3=b1o?=
Date:
March 30, 2021 16:16
Subject:
Re: Perl 7: Fix string leaks? Perl5 Porters <perl5-porters@perl.org>
Message ID:
4b7fb307-1297-7a30-c6ec-38767e1dfa84@gmail.com
On 30/3/21 16:39, Felipe Gasper wrote:
> 
>> On Mar 30, 2021, at 10:26 AM, Salvador Fandiño <sfandino@gmail.com> wrote:
>>
>> On 30/3/21 12:45, Felipe Gasper wrote:
>>> The fix here is to switch the typemap to SvPVbyte so that identical Perl strings will yield identical C representation.
>>
>> Any XS code that is using SvPV to convert SVs to char* is already broken.
>>
>> IMO, the default typemap could be changed right now.
> 
> Wouldn’t that break a great many applications which currently pass decoded strings to XSUBs?

You mean ensuring from the Perl side that any SV has the UTF8 flag set 
before passing it to some XSUB, right?

I guess that is not a very common case.

An XS author facing the issue of passing data to C code as UTF8 has two 
options:

1) Use utf8::upgrade() on the data every time before calling the XSUB.

2) Create a new typemap using SvPVutf8() - the right solution.

No solution is trivial or evident, and would have required investigation 
from the developer. So, I would expect most people did find about 2 and 
used it.

On the other hand I am sure it is quite more common the case of 
programmers that just didn't take into account that data may be UTF8 
encoded. Their code is buggy and they don't know.

So, in the end if you change the default typemap, you may be broken the 
code of some people that didn't find the right solution while fixing the 
code of almost everybody else.

Also, if you make the default typemap croak if the data can not be 
encoded, that would make any broken code very easy to detect. Any 
programmer which have adopted solution 1, would find pretty soon that 
something is broken in his code.


>>> The same bug exists if you mkdir($str_a) and mkdir($str_b). See Sys::Binmode for a CPAN fix. I hope to convince enough folks here to bring that into core eventually.
>>
>> If I understand what Sys::Binmode does correctly, it just ensures that any string going to be used in a system call is first downgraded to
>> bytes, and so, effectively encoding it as latin1 or whatever it is the default code page in Windows.
>>
>> I don't think that really tackles the real problem. It fixes some issues but it is not what Perl really needs in the long run.
>>
>> I don't see as acceptable either to ask the programmer to do the encoding/decoding explicitly in his code every time before/after some data crosses the OS interface.
>>
>> Other languages (Python, Ruby) try to guess the encoding from the environment in Linux/UNIX/OS X or use the default encoding, UTF-16, on Windows.
>>
>> That also has their issues as OSs don't check the validity of the given strings and so file systems can contain invalid sequences and Perl should be able to handle those anyway. But encodings like WTF8 can be used to overcome that.
> 
> I agree with much of this. Sys::Binmode is a “band-aid” that at least plugs the abstraction leak, but as you say, the ideal fix would be to teach Perl to encode for the OS automatically.
> 
> That speaks, though, to the necessity of teaching Perl to track which strings are decoded (i.e., are text) and which are not (i.e., are binary). As Dan indicated, that’s a tough problem to solve.

I don't think that is required in general. Bytes characters are a subset 
of Unicode characters so an UTF8 string can represent a byte string as 
well (as it is already done).

Encoding/decoding should be done at the boundaries (syscall, XS, etc.) 
using sensible defaults and/or allowing the user to set them (as in 
PerlIO). That would IMO fix most of the problems.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About