develooper Front page | perl.perl5.porters | Postings from March 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
Felipe Gasper
March 30, 2021 14:39
Re: Perl 7: Fix string leaks?
Message ID:

> On Mar 30, 2021, at 10:26 AM, Salvador Fandiño <> wrote:
> On 30/3/21 12:45, Felipe Gasper wrote:
>> The fix here is to switch the typemap to SvPVbyte so that identical Perl strings will yield identical C representation.
> Any XS code that is using SvPV to convert SVs to char* is already broken.
> IMO, the default typemap could be changed right now.

Wouldn’t that break a great many applications which currently pass decoded strings to XSUBs?

>> The same bug exists if you mkdir($str_a) and mkdir($str_b). See Sys::Binmode for a CPAN fix. I hope to convince enough folks here to bring that into core eventually.
> If I understand what Sys::Binmode does correctly, it just ensures that any string going to be used in a system call is first downgraded to
> bytes, and so, effectively encoding it as latin1 or whatever it is the default code page in Windows.
> I don't think that really tackles the real problem. It fixes some issues but it is not what Perl really needs in the long run.
> I don't see as acceptable either to ask the programmer to do the encoding/decoding explicitly in his code every time before/after some data crosses the OS interface.
> Other languages (Python, Ruby) try to guess the encoding from the environment in Linux/UNIX/OS X or use the default encoding, UTF-16, on Windows.
> That also has their issues as OSs don't check the validity of the given strings and so file systems can contain invalid sequences and Perl should be able to handle those anyway. But encodings like WTF8 can be used to overcome that.

I agree with much of this. Sys::Binmode is a “band-aid” that at least plugs the abstraction leak, but as you say, the ideal fix would be to teach Perl to encode for the OS automatically.

That speaks, though, to the necessity of teaching Perl to track which strings are decoded (i.e., are text) and which are not (i.e., are binary). As Dan indicated, that’s a tough problem to solve.

Toward that end, though …

It’s my understanding that there are unused bits in the SV. What if we used two of those to store an enum that records the decoded/encoded state, thus:

enum sv_string_type {
    SV_STRING_TYPE_TEXT,     /* decoded */
    SV_STRING_TYPE_BINARY,   /* encoded */
    /* unused */

… then some new core mechanism were aware of that enum and die()d if an attempt to double-encode or double-decode happened. So you’d have:

my $str = <STDIN>;    # SV_STRING_TYPE_UNKNOWN by default, configurable.

text::decode_utf8($str);   # sets SV_STRING_TYPE_TEXT

text::decode_utf8($str);   # oops! die()s

text::encode_utf8($str);   # sets SV_STRING_TYPE_BINARY

text::encode_utf8($str);   # oops! die()s

# Existing code, of course, decodes using something like:

# $str is still SV_STRING_TYPE_BINARY, so new, text-aware Perl would need to
# set SV_STRING_TYPE_TEXT without actually decoding:

# Likewise with encode operations:

Obviously there’d be a lot of text::set()’ing going on for a while, but a) it would all be optional, and b) applications that already exercise proper “Eternal Vigilance” are well-positioned for this already. Applications that mishandle it already would, of course, have a meaningless sv_string_type -- which is no less than what they have now.

This would allow Perl to encode “known text” strings for the OS. So Perl in Windows could use the Unicode APIs, for example, and applications could know right away if they have double-decode or double-encode errors.

I’m sure there are problems with it, but … thoughts?

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About