develooper Front page | perl.perl5.porters | Postings from April 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
From:
Yuki Kimoto
Date:
April 2, 2021 01:58
Subject:
Re: Perl 7: Fix string leaks?
Message ID:
CAExogxNJyrLUWRAju41vSpTJD9t4DSOOkC7b+XMNBw5_f8o-TA@mail.gmail.com
Dan

I've been thinking the following, but am I wrong?

Downgraded format : latin-1
Upgraded format : UTF-8

And I have a question

What means that the Downgraded format is more efficient?

Count char count? latin-1 ia fast because byte count is the same as text
count. UTF-8 is slow for sequential access.




2021年4月2日(金) 9:58 Dan Book <grinnz@gmail.com>:

> On Thu, Apr 1, 2021 at 8:51 PM Ben Bullock <benkasminbullock@gmail.com>
> wrote:
>
>> On Thu, 1 Apr 2021 at 21:18, Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>>
>> > Appending a >255 code point to a string will always upgrade its
>> > storage. encode('Latin-1') seems to upgrade (which I think is
>> > weird).
>>
>> Perl seems to have been set up the way it is to deal with the
>> following sort of ambiguity:
>>
>> use utf8;
>> use File::Slurper qw!read_binary write_binary!;
>> `wget -o /dev/null -O qr.png https://www.qrpng.org/qrpng.cgi`
>> <https://www.qrpng.org/qrpng.cgi>;
>> my $png = read_binary ('qr.png');
>> $png .= '🥦';
>> write_binary ('qr-broccolli.png', $png);
>> $png =~ s/.$//;
>> write_binary ('qr-no-broccolli.png', $png);
>> print `file *.png`;
>>
>> The surprising part is that qr-no-broccolli.png actually gets written
>> correctly after qr-broccolli.png is mangled. But Perl is making an
>> assumption about arbitrary data which happens to work in the above
>> case.
>>
>> Am I correct in thinking that your "this is binary" data would
>> actually stop the "broccolli" from being added here and remove the
>> ambiguity?
>>
>> > I can’t think of any scenario where Perl itself would downgrade a
>> > string
>>
>> I can't either. I went through the source code of Perl trying to find
>> a place where it does that but couldn't find one.
>>
>> > though XS modules do all manner of funny business in that
>> > regard. The bigger point, though, is that because the behaviour here
>> > is unspecified, the burden of proof logically lies the other way:
>> > one should demonstrate that there are *no* places where perl
>> > downgrades a string, in default of which demonstration we must
>> > assume that Perl may, at any time and for any reason, downgrade a
>> > bytes-compatible string.
>>
>> It seems to make a lot of work.
>>
>> > >
>> > >> If Perl could distinguish binary from text we could prevent
>> > >> that. (See my proposal earlier in this thread.)
>> > >
>> > > I only subscribed to this mailing list a short while ago and the web
>> > > server for the mailing list is out of action at the moment, was this
>> > > your proposal to add more bit flags? I really thought that was a very
>> > > good idea, in terms of solving the problem of dealing with the
>> > > utf8_downgrade problem, but I can't find the original post now.
>> >
>> > Yes, that was it. Re-pasted for convenience:
>>
>> Thank you.
>>
>> >
>> > -----
>> > It’s my understanding that there are unused bits in the SV. What if we
>> used two of those to store an enum that records the decoded/encoded state,
>> thus:
>> >
>> > enum sv_string_type {
>> >    SV_STRING_TYPE_UNKNOWN,
>> >    SV_STRING_TYPE_TEXT,     /* decoded */
>> >    SV_STRING_TYPE_BINARY,   /* encoded */
>> >    /* unused */
>> > }
>>
>> OK but why do you need two bits? Let's say you just have a "binary"
>> bit which stops it from being altered. So in my PNG example, when I
>> read the file and it has bytes >= 128 then it's marked as binary, then
>> it stops/warns at the .= '🥦' stage rather than at the stage of
>> writing the file.
>>
>> >
>> > … then some new core mechanism were aware of that enum and die()d if an
>> attempt to double-encode or double-decode happened. So you’d have:
>> >
>> > my $str = <STDIN>;    # SV_STRING_TYPE_UNKNOWN by default, configurable.
>> >
>> > text::decode_utf8($str);   # sets SV_STRING_TYPE_TEXT
>> >
>> > text::decode_utf8($str);   # oops! die()s
>> >
>> > text::encode_utf8($str);   # sets SV_STRING_TYPE_BINARY
>> >
>> > text::encode_utf8($str);   # oops! die()s
>>
>> Double decoding and double encoding is an artefact of Perl trying to
>> guess whether you have binary data or not. In the broccolli example,
>> the second time it writes the PNG it correctly guesses the encoding,
>> but when it allows you to add on the emoji, it incorrectly guesses it
>> to be "LATIN-1" or whatever it thinks it is.
>>
>> I think this can all be prevented with a single bit, since you already
>> have the SvUTF8 flag. If SvUTF8 flag is the first bit here and "is
>> binary" is the second then:
>>
>> 00 - unknown text
>> 01 - binary, not text, don't upgrade
>> 10 - UTF-8 text
>> 11 - impossible state
>>
>> 00 -> 10 upgrade OK
>> 10 -> upgrade fails because "already upgraded"
>> 01 -> upgrade fails because binary
>> 11 -> impossible state
>>
>> 00 -> downgrade fails because "already downgraded"
>> 10 -> 00 downgrade OK
>> 01 -> downgrade fails because binary
>> 11 -> impossible state
>>
>> The "binary" bit would be set not only for actual binary data but
>> other things, for example an encoding scheme like EUC or CP932 where
>> the bytes represent text, so the name "binary" is slightly misleading,
>> it could be, for example, the "noupdowngrade" bit.
>>
>> > Obviously there’d be a lot of text::set()’ing going on for a
>> > while, but a) it would all be optional, and b) applications that
>> > already exercise proper “Eternal Vigilance” are well-positioned
>> > for this already. Applications that mishandle it already would, of
>> > course, have a meaningless sv_string_type -- which is no less than
>> > what they have now.
>>
>> Perl, without information about what the data actually consists of,
>> makes guesses, like we saw in the broccolli example above, and the
>> guesses sometimes turn out to be wrong. What I would suggest is that
>> Perl is made to stop making the guesses, by just having a way to say
>> "this data cannot be up/down graded". All that needs is one "binary"
>> bit which prevents it from happening.
>>
>
> There is some misunderstanding here. The UTF8 bit does not constitute any
> guess, so it cannot be combined with an explicitly set bit for this
> purpose. It indicates which format the internal bytes are *definitely* in,
> which Perl is allowed to change whenever needed and the user cannot depend
> on.
>
> When you append an emoji to your bytestring, it forces it to be in
> upgraded format, but the rest of the string contains the same bytes it did
> before, even though they are now stored differently. You can verify this by
> comparing the original string with a substring of the modified string. Thus
> when you remove the emoji codepoint, it is still the same string,
> regardless of the change in storage format.
>
> write_binary with a string that contains a codepoint over 255 is a logic
> error, and you would receive a warning upon trying to do this - that Perl
> dumps whatever is in its internal buffer instead in this case is an
> implementation detail, and leads to things "accidentally working" just
> enough to confuse people.
>
> -Dan
>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About