develooper Front page | perl.perl5.porters | Postings from April 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
From:
Ben Bullock
Date:
April 2, 2021 00:50
Subject:
Re: Perl 7: Fix string leaks?
Message ID:
CAN5Y6m-2Agj7WFo73i1jETfnJPGfC5dYQu3HKQevBpTvWK+zCw@mail.gmail.com
On Thu, 1 Apr 2021 at 21:18, Felipe Gasper <felipe@felipegasper.com> wrote:

> Appending a >255 code point to a string will always upgrade its
> storage. encode('Latin-1') seems to upgrade (which I think is
> weird).

Perl seems to have been set up the way it is to deal with the
following sort of ambiguity:

use utf8;
use File::Slurper qw!read_binary write_binary!;
`wget -o /dev/null -O qr.png https://www.qrpng.org/qrpng.cgi`;
my $png = read_binary ('qr.png');
$png .= '🥦';
write_binary ('qr-broccolli.png', $png);
$png =~ s/.$//;
write_binary ('qr-no-broccolli.png', $png);
print `file *.png`;

The surprising part is that qr-no-broccolli.png actually gets written
correctly after qr-broccolli.png is mangled. But Perl is making an
assumption about arbitrary data which happens to work in the above
case.

Am I correct in thinking that your "this is binary" data would
actually stop the "broccolli" from being added here and remove the
ambiguity?

> I can’t think of any scenario where Perl itself would downgrade a
> string

I can't either. I went through the source code of Perl trying to find
a place where it does that but couldn't find one.

> though XS modules do all manner of funny business in that
> regard. The bigger point, though, is that because the behaviour here
> is unspecified, the burden of proof logically lies the other way:
> one should demonstrate that there are *no* places where perl
> downgrades a string, in default of which demonstration we must
> assume that Perl may, at any time and for any reason, downgrade a
> bytes-compatible string.

It seems to make a lot of work.

> >
> >> If Perl could distinguish binary from text we could prevent
> >> that. (See my proposal earlier in this thread.)
> >
> > I only subscribed to this mailing list a short while ago and the web
> > server for the mailing list is out of action at the moment, was this
> > your proposal to add more bit flags? I really thought that was a very
> > good idea, in terms of solving the problem of dealing with the
> > utf8_downgrade problem, but I can't find the original post now.
>
> Yes, that was it. Re-pasted for convenience:

Thank you.

>
> -----
> It’s my understanding that there are unused bits in the SV. What if we used two of those to store an enum that records the decoded/encoded state, thus:
>
> enum sv_string_type {
>    SV_STRING_TYPE_UNKNOWN,
>    SV_STRING_TYPE_TEXT,     /* decoded */
>    SV_STRING_TYPE_BINARY,   /* encoded */
>    /* unused */
> }

OK but why do you need two bits? Let's say you just have a "binary"
bit which stops it from being altered. So in my PNG example, when I
read the file and it has bytes >= 128 then it's marked as binary, then
it stops/warns at the .= '🥦' stage rather than at the stage of
writing the file.

>
> … then some new core mechanism were aware of that enum and die()d if an attempt to double-encode or double-decode happened. So you’d have:
>
> my $str = <STDIN>;    # SV_STRING_TYPE_UNKNOWN by default, configurable.
>
> text::decode_utf8($str);   # sets SV_STRING_TYPE_TEXT
>
> text::decode_utf8($str);   # oops! die()s
>
> text::encode_utf8($str);   # sets SV_STRING_TYPE_BINARY
>
> text::encode_utf8($str);   # oops! die()s

Double decoding and double encoding is an artefact of Perl trying to
guess whether you have binary data or not. In the broccolli example,
the second time it writes the PNG it correctly guesses the encoding,
but when it allows you to add on the emoji, it incorrectly guesses it
to be "LATIN-1" or whatever it thinks it is.

I think this can all be prevented with a single bit, since you already
have the SvUTF8 flag. If SvUTF8 flag is the first bit here and "is
binary" is the second then:

00 - unknown text
01 - binary, not text, don't upgrade
10 - UTF-8 text
11 - impossible state

00 -> 10 upgrade OK
10 -> upgrade fails because "already upgraded"
01 -> upgrade fails because binary
11 -> impossible state

00 -> downgrade fails because "already downgraded"
10 -> 00 downgrade OK
01 -> downgrade fails because binary
11 -> impossible state

The "binary" bit would be set not only for actual binary data but
other things, for example an encoding scheme like EUC or CP932 where
the bytes represent text, so the name "binary" is slightly misleading,
it could be, for example, the "noupdowngrade" bit.

> Obviously there’d be a lot of text::set()’ing going on for a
> while, but a) it would all be optional, and b) applications that
> already exercise proper “Eternal Vigilance” are well-positioned
> for this already. Applications that mishandle it already would, of
> course, have a meaningless sv_string_type -- which is no less than
> what they have now.

Perl, without information about what the data actually consists of,
makes guesses, like we saw in the broccolli example above, and the
guesses sometimes turn out to be wrong. What I would suggest is that
Perl is made to stop making the guesses, by just having a way to say
"this data cannot be up/down graded". All that needs is one "binary"
bit which prevents it from happening.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About