develooper Front page | perl.perl5.porters | Postings from April 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
From:
Dan Book
Date:
April 2, 2021 00:58
Subject:
Re: Perl 7: Fix string leaks?
Message ID:
CABMkAVUEMs1s-Oehc4Hc_JrsH9tThW__Gw06_tiLhDeXPL4U0g@mail.gmail.com
On Thu, Apr 1, 2021 at 8:51 PM Ben Bullock <benkasminbullock@gmail.com>
wrote:

> On Thu, 1 Apr 2021 at 21:18, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
> > Appending a >255 code point to a string will always upgrade its
> > storage. encode('Latin-1') seems to upgrade (which I think is
> > weird).
>
> Perl seems to have been set up the way it is to deal with the
> following sort of ambiguity:
>
> use utf8;
> use File::Slurper qw!read_binary write_binary!;
> `wget -o /dev/null -O qr.png https://www.qrpng.org/qrpng.cgi`
> <https://www.qrpng.org/qrpng.cgi>;
> my $png = read_binary ('qr.png');
> $png .= '🥦';
> write_binary ('qr-broccolli.png', $png);
> $png =~ s/.$//;
> write_binary ('qr-no-broccolli.png', $png);
> print `file *.png`;
>
> The surprising part is that qr-no-broccolli.png actually gets written
> correctly after qr-broccolli.png is mangled. But Perl is making an
> assumption about arbitrary data which happens to work in the above
> case.
>
> Am I correct in thinking that your "this is binary" data would
> actually stop the "broccolli" from being added here and remove the
> ambiguity?
>
> > I can’t think of any scenario where Perl itself would downgrade a
> > string
>
> I can't either. I went through the source code of Perl trying to find
> a place where it does that but couldn't find one.
>
> > though XS modules do all manner of funny business in that
> > regard. The bigger point, though, is that because the behaviour here
> > is unspecified, the burden of proof logically lies the other way:
> > one should demonstrate that there are *no* places where perl
> > downgrades a string, in default of which demonstration we must
> > assume that Perl may, at any time and for any reason, downgrade a
> > bytes-compatible string.
>
> It seems to make a lot of work.
>
> > >
> > >> If Perl could distinguish binary from text we could prevent
> > >> that. (See my proposal earlier in this thread.)
> > >
> > > I only subscribed to this mailing list a short while ago and the web
> > > server for the mailing list is out of action at the moment, was this
> > > your proposal to add more bit flags? I really thought that was a very
> > > good idea, in terms of solving the problem of dealing with the
> > > utf8_downgrade problem, but I can't find the original post now.
> >
> > Yes, that was it. Re-pasted for convenience:
>
> Thank you.
>
> >
> > -----
> > It’s my understanding that there are unused bits in the SV. What if we
> used two of those to store an enum that records the decoded/encoded state,
> thus:
> >
> > enum sv_string_type {
> >    SV_STRING_TYPE_UNKNOWN,
> >    SV_STRING_TYPE_TEXT,     /* decoded */
> >    SV_STRING_TYPE_BINARY,   /* encoded */
> >    /* unused */
> > }
>
> OK but why do you need two bits? Let's say you just have a "binary"
> bit which stops it from being altered. So in my PNG example, when I
> read the file and it has bytes >= 128 then it's marked as binary, then
> it stops/warns at the .= '🥦' stage rather than at the stage of
> writing the file.
>
> >
> > … then some new core mechanism were aware of that enum and die()d if an
> attempt to double-encode or double-decode happened. So you’d have:
> >
> > my $str = <STDIN>;    # SV_STRING_TYPE_UNKNOWN by default, configurable.
> >
> > text::decode_utf8($str);   # sets SV_STRING_TYPE_TEXT
> >
> > text::decode_utf8($str);   # oops! die()s
> >
> > text::encode_utf8($str);   # sets SV_STRING_TYPE_BINARY
> >
> > text::encode_utf8($str);   # oops! die()s
>
> Double decoding and double encoding is an artefact of Perl trying to
> guess whether you have binary data or not. In the broccolli example,
> the second time it writes the PNG it correctly guesses the encoding,
> but when it allows you to add on the emoji, it incorrectly guesses it
> to be "LATIN-1" or whatever it thinks it is.
>
> I think this can all be prevented with a single bit, since you already
> have the SvUTF8 flag. If SvUTF8 flag is the first bit here and "is
> binary" is the second then:
>
> 00 - unknown text
> 01 - binary, not text, don't upgrade
> 10 - UTF-8 text
> 11 - impossible state
>
> 00 -> 10 upgrade OK
> 10 -> upgrade fails because "already upgraded"
> 01 -> upgrade fails because binary
> 11 -> impossible state
>
> 00 -> downgrade fails because "already downgraded"
> 10 -> 00 downgrade OK
> 01 -> downgrade fails because binary
> 11 -> impossible state
>
> The "binary" bit would be set not only for actual binary data but
> other things, for example an encoding scheme like EUC or CP932 where
> the bytes represent text, so the name "binary" is slightly misleading,
> it could be, for example, the "noupdowngrade" bit.
>
> > Obviously there’d be a lot of text::set()’ing going on for a
> > while, but a) it would all be optional, and b) applications that
> > already exercise proper “Eternal Vigilance” are well-positioned
> > for this already. Applications that mishandle it already would, of
> > course, have a meaningless sv_string_type -- which is no less than
> > what they have now.
>
> Perl, without information about what the data actually consists of,
> makes guesses, like we saw in the broccolli example above, and the
> guesses sometimes turn out to be wrong. What I would suggest is that
> Perl is made to stop making the guesses, by just having a way to say
> "this data cannot be up/down graded". All that needs is one "binary"
> bit which prevents it from happening.
>

There is some misunderstanding here. The UTF8 bit does not constitute any
guess, so it cannot be combined with an explicitly set bit for this
purpose. It indicates which format the internal bytes are *definitely* in,
which Perl is allowed to change whenever needed and the user cannot depend
on.

When you append an emoji to your bytestring, it forces it to be in upgraded
format, but the rest of the string contains the same bytes it did before,
even though they are now stored differently. You can verify this by
comparing the original string with a substring of the modified string. Thus
when you remove the emoji codepoint, it is still the same string,
regardless of the change in storage format.

write_binary with a string that contains a codepoint over 255 is a logic
error, and you would receive a warning upon trying to do this - that Perl
dumps whatever is in its internal buffer instead in this case is an
implementation detail, and leads to things "accidentally working" just
enough to confuse people.

-Dan

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About