develooper Front page | perl.perl5.porters | Postings from April 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
Dan Book
April 2, 2021 06:56
Re: Perl 7: Fix string leaks?
Message ID:
On Thu, Apr 1, 2021, 10:46 PM Dan Book <> wrote:

> On Thu, Apr 1, 2021 at 10:39 PM Ben Bullock <>
> wrote:
>> On Fri, 2 Apr 2021 at 09:58, Dan Book <> wrote:
>> > The UTF8 bit does not constitute any guess, so it cannot be combined
>> > with an explicitly set bit for this purpose. It indicates which
>> > format the internal bytes are *definitely* in, which Perl is allowed
>> > to change whenever needed and the user cannot depend on.
>> In this case, the UTF8 bit of the composed string is the result of a
>> guess. There are two scalars concatenated together, one has the UTF8
>> bit set, and one does not have it set. The output of the
>> concatentation has the UTF8 bit set. That does indeed constitute a
>> guess as to what is in the scalars. If the output did not have the
>> UTF8 bit set, that would also constitute a guess. Either outcome is a
>> guess. Perl does not have enough information to decide unambiguously
>> if the composed string is meant to be UTF8 or not.
> It doesn't need to know. The UTF8 bit indicates how the codepoints are
> stored and makes no judgment on its content.
>> > When you append an emoji to your bytestring, it forces it to be in
>> > upgraded format, but the rest of the string contains the same bytes
>> > it did before, even though they are now stored differently. You can
>> > verify this by comparing the original string with a substring of the
>> > modified string. Thus when you remove the emoji codepoint, it is
>> > still the same string, regardless of the change in storage format.
>> But if Perl guessed the other way, and set the UTF8 flag to zero
>> instead of one, everything about the string which you've said above
>> would remain true:
>> no utf8;
>> use File::Slurper qw!read_binary write_binary!;
>> `wget -o /dev/null -O qr.png`
>> <>;
>> my $png = read_binary ('qr.png');
>> my $bpng = $png . '🥦';
>> if (substr ($bpng, 0, length ($png)) eq $png) {
>>     print "correct.\n";
>> }
>> write_binary ('qr-broccolli.png', $bpng);
>> $bpng =~ s/🥦$//;
>> write_binary ('qr-no-broccolli.png', $bpng);
>> print `file *.png`;
> This is not a different guess, but a different string, because you've
> appended the UTF-8 bytes representing the emoji instead of a character.
>> > write_binary with a string that contains a codepoint over 255 is a
>> > logic error, and you would receive a warning upon trying to do this
>> > - that Perl dumps whatever is in its internal buffer instead in this
>> > case is an implementation detail, and leads to things "accidentally
>> > working" just enough to confuse people.
>> "The curious incident of the dog in the night-time."
>> The first call to write_binary gives a warning, but the second call to
>> write_binary does not. Perl is using a heuristic ("guessing"), it
>> doesn't have any information about the content of $bpng in either
>> case.
>> This is why the double-encoding and double-decoding problem occurs.
> No, it has exactly the information you've given it: a string which
> contains a codepoint too large to write to a raw bytestream.

For posterity, I am not trying to win an argument or have a discussion
about this; this is literally the design of the Perl string model. The UTF8
bit does not and cannot indicate the intent of the string (this must be
decided by whatever receives the string as an argument), and separately, it
always reliably indicates which of two forms the string is internally
stored in at that time.

These two components have no relation except where physically necessary
(e.g. a character string with codepoints larger than 255 can only be stored
upgraded). And since the internal format uses a UTF-8-compatible encoding
and encoded text strings often are expected in UTF-8, these two concerns
are often logically conflated.

For example, an upgraded byte string is perfectly legal and normal in
Perl's string model, and every correct usage of that string will interpret
it as the original byte string and not the internally stored one, which
would appear double encoded if you looked at the internals without context.
Double encoding in output only occurs from this when the model is used
incorrectly, through a bug in Perl (such as filename handling), XS code, or
other reaching into internals, as discussed in this thread.

- Dan


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About