develooper Front page | perl.perl5.porters | Postings from April 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
Felipe Gasper
April 1, 2021 12:18
Re: Perl 7: Fix string leaks?
Message ID:

> On Apr 1, 2021, at 7:56 AM, Ben Bullock <> wrote:
> On Thu, 1 Apr 2021 at 20:15, Felipe Gasper <> wrote:
>> Also, this isn’t quite true. It’s entirely possible for Perl or
>> some XS module to upgrade a string that contains a PNG.
> It's possible for an XS module to do that but it's not relevant, since
> the content is not text data, and it is still possible to do that no
> matter how the bytes are obtained, using SvPVbyte or whatever:
> XS code (sorry but gmail hates tabs so the indentation is missing):
> SV * disaster (gd)
> SV * gd
> char * c;
> c = SvPVbyte (gd, l);
> RETVAL = newSVpv (c, l);
> SvUTF8_on (RETVAL);
> Script to run it:
> no utf8;
> my $input = 'かきくけこ';
> my $output = g::disaster ($input);
> binmode STDOUT, ":encoding(utf8)";
> print "$output\n";
> I'm not sure under what circumstances Perl would arbitrarily
> up/downgrade a binary string, do you have a code example which doesn't
> involve directly calling utf8::utt8_down/upgrade? It seems not
> possible to me unless the user does some kind of broken string
> manipulation onto the data.

Appending a >255 code point to a string will always upgrade its storage. encode('Latin-1') seems to upgrade (which I think is weird).

I can’t think of any scenario where Perl itself would downgrade a string, though XS modules do all manner of funny business in that regard. The bigger point, though, is that because the behaviour here is unspecified, the burden of proof logically lies the other way: one should demonstrate that there are *no* places where perl downgrades a string, in default of which demonstration we must assume that Perl may, at any time and for any reason, downgrade a bytes-compatible string.

>> If Perl could distinguish binary from text we could prevent
>> that. (See my proposal earlier in this thread.)
> I only subscribed to this mailing list a short while ago and the web
> server for the mailing list is out of action at the moment, was this
> your proposal to add more bit flags? I really thought that was a very
> good idea, in terms of solving the problem of dealing with the
> utf8_downgrade problem, but I can't find the original post now.

Yes, that was it. Re-pasted for convenience:

It’s my understanding that there are unused bits in the SV. What if we used two of those to store an enum that records the decoded/encoded state, thus:

enum sv_string_type {
   SV_STRING_TYPE_TEXT,     /* decoded */
   SV_STRING_TYPE_BINARY,   /* encoded */
   /* unused */

… then some new core mechanism were aware of that enum and die()d if an attempt to double-encode or double-decode happened. So you’d have:

my $str = <STDIN>;    # SV_STRING_TYPE_UNKNOWN by default, configurable.

text::decode_utf8($str);   # sets SV_STRING_TYPE_TEXT

text::decode_utf8($str);   # oops! die()s

text::encode_utf8($str);   # sets SV_STRING_TYPE_BINARY

text::encode_utf8($str);   # oops! die()s

# Existing code, of course, decodes using something like:

# $str is still SV_STRING_TYPE_BINARY, so new, text-aware Perl would need to
# set SV_STRING_TYPE_TEXT without actually decoding:

# Likewise with encode operations:

Obviously there’d be a lot of text::set()’ing going on for a while, but a) it would all be optional, and b) applications that already exercise proper “Eternal Vigilance” are well-positioned for this already. Applications that mishandle it already would, of course, have a meaningless sv_string_type -- which is no less than what they have now.

This would allow Perl to encode “known text” strings for the OS. So Perl in Windows could use the Unicode APIs, for example, and applications could know right away if they have double-decode or double-encode errors.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About