develooper Front page | perl.perl5.porters | Postings from January 2020

Re: =?utf-8?B?4oCcc3RyaWN04oCd?= strings?

Thread Previous | Thread Next
From:
Tony Cook
Date:
January 7, 2020 02:43
Subject:
Re: =?utf-8?B?4oCcc3RyaWN04oCd?= strings?
Message ID:
20200107024335.GD5228@mars.tony.develop-help.com
On Mon, Jan 06, 2020 at 08:59:32PM -0500, Felipe Gasper wrote:
> 
> > On Jan 6, 2020, at 7:02 PM, Dan Book <grinnz@gmail.com> wrote:
> > 
> > On Mon, Jan 6, 2020 at 11:07 AM Felipe Gasper <felipe@felipegasper.com> wrote:
> > 
> > Is Sereal::Encode wrong, then? It serializes Perl strings to a format that encodes binary and text as separate types, and the current implementation uses SVfUTF8 to make that distinction.
> > 
> > This is likely just a convenience. Strings without the UTF8 flag set can always be stored as a binary string, even though they might not be logically. Strings with the UTF8 flag may or may not be storable as a binary string. The alternative of storing everything as a text string introduces more false positives and a ton of overhead for the common ASCII or binary string case.
> 
> But consider the case where Sereal is *not* used for Perl-to-Perl IPC … let’s say Perl-to-Go. In such a context it’s important (for Go’s sake) to distinguish blobs from text, so it’s more than a matter of convenience, right? And if the “wrong” type is sent--e.g., a binary string sent as UTF-8 encoded characters--the Go side has to know that and build in logic to handle it.
> 
> Sereal appears to intend to solve that problem by using the UTF8 flag. As per several who’ve commented on this topic, though, this behavior is “broken”, right? At least insofar as the serialization intends to be useful outside Perl … which Sereal does.
> 
> This is the same problem as reliable strings/numbers … but whereas that’s a fairly simple problem to solve (given most encoders’ behavior, anyway), commentary thus far on my suggestion indicates that Perl offers *no* reliable way to output reliable blobs vs. text as the Sereal and CBOR specifications envision. (Notwithstanding the contradiction in Sereal’s spec.) To me that seems a conspicuous “missing feature” … and, without intending to disrespect the wisdom of folks on this list, I still don’t feel like I’ve seen anything that invalidates my idea.

If Sereal converts a SVf_UTF8 off SV-with-PV to a binary specific type in some
other language, that is a bug in Sereal.  I haven't tried it.

You've already been told umpteen times that the SVf_UTF8 flag does not
distinguish between character and binary strings, at least in modern
perl usage.

That flag does not distinguish between character, latin-1 encoded
bytes, raw image data, compressed image data, unpacked LZW compression
codes, etc.

With older perl, before feature unicode_strings it did make a
difference, and that is now considered a bug.

Any patches that attempt to mis-use the SVf_UTF8 flag to distinguish
between character and some type of binary will be rejected.

Something closer to what rjbs suggested might be accepted, but the
types can't be distinguished by the SVf_UTF8 flag.

Of course any such extra checks will slow down any operations that
check for these types.

Tony

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About