Front page | perl.perl5.porters |
Postings from January 2020
From: Felipe Gasper
January 7, 2020 01:59
Message ID: 881D6F00-4EAB-4A4D-BDA3-1527FD0AC3DB@felipegasper.com
> On Jan 6, 2020, at 7:02 PM, Dan Book <email@example.com> wrote:
> On Mon, Jan 6, 2020 at 11:07 AM Felipe Gasper <firstname.lastname@example.org> wrote:
> Is Sereal::Encode wrong, then? It serializes Perl strings to a format that encodes binary and text as separate types, and the current implementation uses SVfUTF8 to make that distinction.
> This is likely just a convenience. Strings without the UTF8 flag set can always be stored as a binary string, even though they might not be logically. Strings with the UTF8 flag may or may not be storable as a binary string. The alternative of storing everything as a text string introduces more false positives and a ton of overhead for the common ASCII or binary string case.
But consider the case where Sereal is *not* used for Perl-to-Perl IPC … let’s say Perl-to-Go. In such a context it’s important (for Go’s sake) to distinguish blobs from text, so it’s more than a matter of convenience, right? And if the “wrong” type is sent--e.g., a binary string sent as UTF-8 encoded characters--the Go side has to know that and build in logic to handle it.
Sereal appears to intend to solve that problem by using the UTF8 flag. As per several who’ve commented on this topic, though, this behavior is “broken”, right? At least insofar as the serialization intends to be useful outside Perl … which Sereal does.
This is the same problem as reliable strings/numbers … but whereas that’s a fairly simple problem to solve (given most encoders’ behavior, anyway), commentary thus far on my suggestion indicates that Perl offers *no* reliable way to output reliable blobs vs. text as the Sereal and CBOR specifications envision. (Notwithstanding the contradiction in Sereal’s spec.) To me that seems a conspicuous “missing feature” … and, without intending to disrespect the wisdom of folks on this list, I still don’t feel like I’ve seen anything that invalidates my idea.
> undecode()d == binary == !SVfUTF8
> decode()d == character == SVfUTF8
> It seems like the language’s public documentation already tells people to write code in a way where SVfUTF8 can indeed be a reliable determinant between character or octet strings.
> Regardless of whether it can be interpreted that way, that's not what happens in reality, either by a ton of Perl code people have written since 5.6 or by Perl itself. Strings can be upgraded at any time and when that happens is undefined. Latin-1 text strings can be upgraded or downgraded depending on how the program is set up, and this is context is not always explicit to the Perl program. In general, it's best to regard SVfUTF8 as an internal flag for XS code, not for Perl code.
Under what I envision, code that works that way can continue to work that way, unconcerned with anything new/experimental. Code that “opts in” to the more stringent behavior would assume the responsibility of inspecting/sanitizing what it gets back from non-strictstrings modules. I *believe* that a legacy module that auto-adds (or auto-loses) a UTF8 flag could be altered to avoid that behavior without altering its contract with other preexisting code.
Just to reiterate: my proposal is that all places that currently auto-manipulate SVfUTF8 would, under strictstrings, throw an exception instead. All transitions between SvUTF8() and !SvUTF8() would have to happen via an explicit decode/encode. This opt-in restriction would be scoped the same as “strict” and other such pragmas.
Thank you, everyone who’s offered their thoughts on this idea. I know encoding matters are a well-trodden topic. I’m going to see if I can put together a proof-of-concept as something a bit more concrete.