develooper Front page | perl.perl5.porters | Postings from April 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
From:
=?UTF-8?Q?Salvador_Fandi=c3=b1o?=
Date:
April 2, 2021 08:10
Subject:
Re: Perl 7: Fix string leaks?
Message ID:
4e5f6ca3-4ed0-b643-0392-50e77fad1806@gmail.com
On 2/4/21 4:38, Ben Bullock wrote:
> On Fri, 2 Apr 2021 at 09:58, Dan Book <grinnz@gmail.com> wrote:
> 
>> The UTF8 bit does not constitute any guess, so it cannot be combined
>> with an explicitly set bit for this purpose. It indicates which
>> format the internal bytes are *definitely* in, which Perl is allowed
>> to change whenever needed and the user cannot depend on.
> 
> In this case, the UTF8 bit of the composed string is the result of a
> guess. There are two scalars concatenated together, one has the UTF8
> bit set, and one does not have it set. The output of the
> concatentation has the UTF8 bit set. That does indeed constitute a
> guess as to what is in the scalars. If the output did not have the
> UTF8 bit set, that would also constitute a guess. Either outcome is a
> guess. Perl does not have enough information to decide unambiguously
> if the composed string is meant to be UTF8 or not.


There is no guessing here. The UTF8 flag is just an implementation detail!

 From a logical point of view perl knows only about characters. Those 
characters are Unicode characters but you can think of them as numbers 
in the range 0-1114111.

When Perl reads binary data, it reads a string with characters in the 
range 0-255. If you then append to that sequence the character with 
number 129382 (🥦), you get just another sequence of characters. There 
is no difference between binary and text data because it is not necessary.

Internally, as an optimization, and because all the characters have 
values under 256, perl encodes the initial string using one byte per 
character. When you add the new character with a value over 255, it 
switches the internal format to another one that can handle the full 
range. That representation being utf8 is not important and it doesn't 
mean that the data has any relation to utf8 outside of the guts. I guess 
it was picked because it is an efficient way to represent data when most 
of the characters are ASCII. But it could be UTF-16 (Windows does that 
for instance, and IIRC, Java too) or UTF-32 (Python?).

That's the theory, at least, because the problem we have is that there 
are some places where that internal representation leaks to the outside. 
But those are bugs in perl, or shortcomings of the design in need of 
some extra functionality.

Bugs are for instance, that built-ins accessing the file system as 
"symlink", "open", "chdir", "glob", etc., pass to the OS whatever string 
they found in the PV slot without taking into consideration the internal 
format. As the internal format is both the result of its contents and of 
the history of the SV, sometimes the OS gets that argument as bytes 
(which effectively is encoded as latin1), sometimes as utf-8. It is 
completely unreliable. In order to overcome that, the programmer has to 
take care to ensure that the right data is in the PV slot.

BTW, this issue was reported long ago: 
https://github.com/perl/perl5/issues/15883

But even if that were fixed, for instance forcing one of the 
representations in the arguments to those functions (as Felipe's 
Sys::Binmode does), the programmer would have to encode the file names 
using the right encoding explicitly. And that is the shortcoming I was 
talking about!

IMO, it is not acceptable to ask the programmer to wrap every argument 
to a OS call with "Encode::encode($os_encoding, $arg)" or some 
equivalent code. Perl needs to take care of this transparently, encoding 
data in the OS configured encoding (UTF-16 in Windows, whatever LC_TYPE 
says in Unix).


>> When you append an emoji to your bytestring, it forces it to be in
>> upgraded format, but the rest of the string contains the same bytes
>> it did before, even though they are now stored differently. You can
>> verify this by comparing the original string with a substring of the
>> modified string. Thus when you remove the emoji codepoint, it is
>> still the same string, regardless of the change in storage format.
> 
> But if Perl guessed the other way, and set the UTF8 flag to zero
> instead of one, everything about the string which you've said above
> would remain true:
> 
> no utf8;
> use File::Slurper qw!read_binary write_binary!;
> `wget -o /dev/null -O qr.png https://www.qrpng.org/qrpng.cgi`;
> my $png = read_binary ('qr.png');
> my $bpng = $png . '🥦';
> if (substr ($bpng, 0, length ($png)) eq $png) {
>      print "correct.\n";
> }
> write_binary ('qr-broccolli.png', $bpng);
> $bpng =~ s/🥦$//;
> write_binary ('qr-no-broccolli.png', $bpng);
> print `file *.png`;
> 
>> write_binary with a string that contains a codepoint over 255 is a
>> logic error, and you would receive a warning upon trying to do this
>> - that Perl dumps whatever is in its internal buffer instead in this
>> case is an implementation detail, and leads to things "accidentally
>> working" just enough to confuse people.
> 
> "The curious incident of the dog in the night-time."
> 
> The first call to write_binary gives a warning, but the second call to
> write_binary does not. Perl is using a heuristic ("guessing"), it
> doesn't have any information about the content of $bpng in either
> case.

There is not guessing here either. In the first case the data has a 
character with a value over 255 so it warns that it can not be saved 
correctly.

In the second case, that problematic value has been removed and so, all 
the remaining data can be written without issue.



Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About