Front page | perl.perl5.porters |
Postings from January 2020
Re: “strict” strings?
Thread Previous
|
Thread Next
From:
demerphq
Date:
January 7, 2020 16:41
Subject:
Re: “strict” strings?
Message ID:
CANgJU+VVbsGupWuUtXb5+kZV-Ni75FTfu=5dXFURk0X8cn2SrA@mail.gmail.com
On Tue, 7 Jan 2020 at 02:59, Felipe Gasper <felipe@felipegasper.com> wrote:
>
>
> > On Jan 6, 2020, at 7:02 PM, Dan Book <grinnz@gmail.com> wrote:
> >
> > On Mon, Jan 6, 2020 at 11:07 AM Felipe Gasper <felipe@felipegasper.com> wrote:
> >
> > Is Sereal::Encode wrong, then? It serializes Perl strings to a format that encodes binary and text as separate types, and the current implementation uses SVfUTF8 to make that distinction.
Just wanted to repeat what I said earlier, the choice of BINARY for
certain text types in Sereal is merely an accident, we didn't mean
"this is NOT text" by saying "BINARY", it actually means "this cannot
be assumed to be utf8 encoded".
I think people get confused by this subject because they have a broken
mental model of what "text" is. Text is just a series of numbers which
are given semantic meaning by associating them with a glyph, and it is
that glyph which has semantic meaning to humans.
In perl internals there are relatively few places that care about the
semantic meaning of these numbers, with the predominat case being
where case-transformations or case-insensitivity is implemented. Eg,
lc().
It is ONLY these places that give the UTF8 flag *any* sense of meaning
"text", but it is purely at the level of "when the utf8 flag is ON
apply the case transformation rules specified by the Unicode
consortium". When the UTF8 flag is off our logic does NOT say "this
string is binary", it says "the case transformation rules that apply
to this sequence of numbers is that specified by ASCII". An example of
the difference is that "ss"=~/\x{DF}/i matches when the flag is on
and does not match when the flag is off.
But neither form actually means "this string contains text". Consider
the following code:
my $packed= pack "N", 1113703327;
Encode::_utf8_on($packed);
print encode_utf8($packed);
Is $packed the "text" value "BaÃ" or is it the 32 bit big-endian
representation of 1113703327? And what about when we add this:
Encode::_utf8_off($packed);
utf8::upgrade($packed);
Encode::_utf8_off($packed);
print length($packed);
After all this is it text or binary? $packed *still* contains valid
utf8 sequences and the UTF8 flag is on, but now it contains a sequence
of octets that are actually the utf8 representation of the bytes used
in the utf8 representation of "BaÃ".
The point here is that the flag being off or on really says very
little about the semantic meaning of the contents, and much more about
the mechanics needed to process the text, and the rules that should be
applied if the buffer is fed to a function which is "semantics aware",
but that a specific set of rule should be applied really says very
very little about the real meaning of the contents of a string. I have
seen people *deliberately* turn off the utf8 flag and then modify the
utf8 octets at the "binary" level in a s/// so that they do not have
to incure the penalty of treating the string as utf8. Turning that
flag doesn't magically change the data from being text to being
binary, and turning it on doesn't magically change the data from being
binary to text, although turning it on is a an inherently dangerous
operation, doing so inappropriately can make Perl very unhappy.
> >
> > This is likely just a convenience. Strings without the UTF8 flag set can always be stored as a binary string, even though they might not be logically. Strings with the UTF8 flag may or may not be storable as a binary string. The alternative of storing everything as a text string introduces more false positives and a ton of overhead for the common ASCII or binary string case.
>
> But consider the case where Sereal is *not* used for Perl-to-Perl IPC ⦠letâs say Perl-to-Go. In such a context itâs important (for Goâs sake) to distinguish blobs from text, so itâs more than a matter of convenience, right? And if the âwrongâ type is sent--e.g., a binary string sent as UTF-8 encoded characters--the Go side has to know that and build in logic to handle it.
I dont really get what you mean by "blob" versus "text". To me they
are questions of semantic meaning which cannot be determined by code.
Sereal *trusts* the utf8 flag, in the sense that it does not validate
that a buffer with the utf8 flag on actually contains only valid utf8
sequences, and uses a specific tag for such data so that on the other
end the other language can Do The Right Thing. For instance several
languages use UTF-16 or UTF-32 internally. Such language might
translate a utf8 sequence into the relevant UTF-16 or UTF-32
representation expected by the language. When sending data that
contains such strings they would encode it as utf8 and Perl would Do
The Right Thing on its end. The BINARY format says "I make no
commitments as to how the data is encoded", do whatever is appropriate
on your end.
> Sereal appears to intend to solve that problem by using the UTF8 flag. As per several whoâve commented on this topic, though, this behavior is âbrokenâ, right? At least insofar as the serialization intends to be useful outside Perl ⦠which Sereal does.
I wouldn't say that what Sereal does is intended to distinguish *text*
from *non-text*, what it does is distinguish data that should be
decoded using utf8 rules from data where the encoding is unspecified.
That it has two types of text is purely to facilitate roundtripping a
utf8 on string, not to distinguish text from non-text.
>
> This is the same problem as reliable strings/numbers ⦠but whereas thatâs a fairly simple problem to solve (given most encodersâ behavior, anyway), commentary thus far on my suggestion indicates that Perl offers *no* reliable way to output reliable blobs vs. text as the Sereal and CBOR specifications envision. (Notwithstanding the contradiction in Serealâs spec.) To me that seems a conspicuous âmissing featureâ ⦠and, without intending to disrespect the wisdom of folks on this list, I still donât feel like Iâve seen anything that invalidates my idea.
>
> > undecode()d == binary == !SVfUTF8
> > decode()d == character == SVfUTF8
> >
> > It seems like the languageâs public documentation already tells people to write code in a way where SVfUTF8 can indeed be a reliable determinant between character or octet strings.
> >
> > Regardless of whether it can be interpreted that way, that's not what happens in reality, either by a ton of Perl code people have written since 5.6 or by Perl itself. Strings can be upgraded at any time and when that happens is undefined. Latin-1 text strings can be upgraded or downgraded depending on how the program is set up, and this is context is not always explicit to the Perl program. In general, it's best to regard SVfUTF8 as an internal flag for XS code, not for Perl code.
>
> Under what I envision, code that works that way can continue to work that way, unconcerned with anything new/experimental. Code that âopts inâ to the more stringent behavior would assume the responsibility of inspecting/sanitizing what it gets back from non-strictstrings modules. I *believe* that a legacy module that auto-adds (or auto-loses) a UTF8 flag could be altered to avoid that behavior without altering its contract with other preexisting code.
>
> Just to reiterate: my proposal is that all places that currently auto-manipulate SVfUTF8 would, under strictstrings, throw an exception instead. All transitions between SvUTF8() and !SvUTF8() would have to happen via an explicit decode/encode. This opt-in restriction would be scoped the same as âstrictâ and other such pragmas.
This boat has sailed IMO. It probably would have saved some heartache
if it had been done this way to start, but at this point I dont think
it can be reasonably changed.
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next