develooper Front page | perl.perl5.porters | Postings from April 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
From:
Dan Book
Date:
April 2, 2021 02:46
Subject:
Re: Perl 7: Fix string leaks?
Message ID:
CABMkAVX0Ab+3Nb4wmF03O8yR9-pB8h6b9s54rDUPGjsQxZpRgQ@mail.gmail.com
On Thu, Apr 1, 2021 at 10:39 PM Ben Bullock <benkasminbullock@gmail.com>
wrote:

> On Fri, 2 Apr 2021 at 09:58, Dan Book <grinnz@gmail.com> wrote:
>
> > The UTF8 bit does not constitute any guess, so it cannot be combined
> > with an explicitly set bit for this purpose. It indicates which
> > format the internal bytes are *definitely* in, which Perl is allowed
> > to change whenever needed and the user cannot depend on.
>
> In this case, the UTF8 bit of the composed string is the result of a
> guess. There are two scalars concatenated together, one has the UTF8
> bit set, and one does not have it set. The output of the
> concatentation has the UTF8 bit set. That does indeed constitute a
> guess as to what is in the scalars. If the output did not have the
> UTF8 bit set, that would also constitute a guess. Either outcome is a
> guess. Perl does not have enough information to decide unambiguously
> if the composed string is meant to be UTF8 or not.
>

It doesn't need to know. The UTF8 bit indicates how the codepoints are
stored and makes no judgment on its content.


>
> > When you append an emoji to your bytestring, it forces it to be in
> > upgraded format, but the rest of the string contains the same bytes
> > it did before, even though they are now stored differently. You can
> > verify this by comparing the original string with a substring of the
> > modified string. Thus when you remove the emoji codepoint, it is
> > still the same string, regardless of the change in storage format.
>
> But if Perl guessed the other way, and set the UTF8 flag to zero
> instead of one, everything about the string which you've said above
> would remain true:
>
> no utf8;
> use File::Slurper qw!read_binary write_binary!;
> `wget -o /dev/null -O qr.png https://www.qrpng.org/qrpng.cgi`
> <https://www.qrpng.org/qrpng.cgi>;
> my $png = read_binary ('qr.png');
> my $bpng = $png . '🥦';
> if (substr ($bpng, 0, length ($png)) eq $png) {
>     print "correct.\n";
> }
> write_binary ('qr-broccolli.png', $bpng);
> $bpng =~ s/🥦$//;
> write_binary ('qr-no-broccolli.png', $bpng);
> print `file *.png`;
>

This is not a different guess, but a different string, because you've
appended the UTF-8 bytes representing the emoji instead of a character.


>
> > write_binary with a string that contains a codepoint over 255 is a
> > logic error, and you would receive a warning upon trying to do this
> > - that Perl dumps whatever is in its internal buffer instead in this
> > case is an implementation detail, and leads to things "accidentally
> > working" just enough to confuse people.
>
> "The curious incident of the dog in the night-time."
>
> The first call to write_binary gives a warning, but the second call to
> write_binary does not. Perl is using a heuristic ("guessing"), it
> doesn't have any information about the content of $bpng in either
> case.
>
> This is why the double-encoding and double-decoding problem occurs.
>

No, it has exactly the information you've given it: a string which contains
a codepoint too large to write to a raw bytestream.

-Dan

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About