develooper Front page | perl.perl5.porters | Postings from April 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
From:
Ben Bullock
Date:
April 2, 2021 02:39
Subject:
Re: Perl 7: Fix string leaks?
Message ID:
CAN5Y6m8QTTztzxaL0gqZ6_GbJEoPPUzfj1KCYqJzNHuEKJrKBg@mail.gmail.com
On Fri, 2 Apr 2021 at 09:58, Dan Book <grinnz@gmail.com> wrote:

> The UTF8 bit does not constitute any guess, so it cannot be combined
> with an explicitly set bit for this purpose. It indicates which
> format the internal bytes are *definitely* in, which Perl is allowed
> to change whenever needed and the user cannot depend on.

In this case, the UTF8 bit of the composed string is the result of a
guess. There are two scalars concatenated together, one has the UTF8
bit set, and one does not have it set. The output of the
concatentation has the UTF8 bit set. That does indeed constitute a
guess as to what is in the scalars. If the output did not have the
UTF8 bit set, that would also constitute a guess. Either outcome is a
guess. Perl does not have enough information to decide unambiguously
if the composed string is meant to be UTF8 or not.

> When you append an emoji to your bytestring, it forces it to be in
> upgraded format, but the rest of the string contains the same bytes
> it did before, even though they are now stored differently. You can
> verify this by comparing the original string with a substring of the
> modified string. Thus when you remove the emoji codepoint, it is
> still the same string, regardless of the change in storage format.

But if Perl guessed the other way, and set the UTF8 flag to zero
instead of one, everything about the string which you've said above
would remain true:

no utf8;
use File::Slurper qw!read_binary write_binary!;
`wget -o /dev/null -O qr.png https://www.qrpng.org/qrpng.cgi`;
my $png = read_binary ('qr.png');
my $bpng = $png . '🥦';
if (substr ($bpng, 0, length ($png)) eq $png) {
    print "correct.\n";
}
write_binary ('qr-broccolli.png', $bpng);
$bpng =~ s/🥦$//;
write_binary ('qr-no-broccolli.png', $bpng);
print `file *.png`;

> write_binary with a string that contains a codepoint over 255 is a
> logic error, and you would receive a warning upon trying to do this
> - that Perl dumps whatever is in its internal buffer instead in this
> case is an implementation detail, and leads to things "accidentally
> working" just enough to confuse people.

"The curious incident of the dog in the night-time."

The first call to write_binary gives a warning, but the second call to
write_binary does not. Perl is using a heuristic ("guessing"), it
doesn't have any information about the content of $bpng in either
case.

This is why the double-encoding and double-decoding problem occurs.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About