develooper Front page | perl.perl5.porters | Postings from January 2020

Re: ???strict??? strings?

Thread Previous | Thread Next
From:
Tony Cook
Date:
January 5, 2020 22:55
Subject:
Re: ???strict??? strings?
Message ID:
20200105225534.GA5228@mars.tony.develop-help.com
On Sun, Jan 05, 2020 at 01:05:55PM -0500, Felipe Gasper wrote:
> 
> > On Jan 5, 2020, at 10:07 AM, Zefram <zefram@fysh.org> wrote:
> > 
> > Felipe Gasper wrote:
> >> Is there any supported text-decode operation, as per the
> >> input-decode-work-encode-output workflow described in `perlunitut`,
> >> that doesn't set that flag?
> > 
> > Yes.  Anyone who knows they're decoding Latin-1 is free to do it by *not*
> > calling any decoding function.  In general, anything that decodes to
> > any subset of the Latin-1 character repertoire is free to represent its
> > result in the internal Latin-1 encoding rather than the internal UTF-8.
> > Plenty of code that never generates non-Latin-1 characters does yield
> > downgraded results.
> 
> The workflow you’re describing--considering a non-decode as equivalent to decoding as Latin-1--violates the workflow that `perlunitut` prescribes. At least, I submit that *most* people who read `perlunitut` would think that document inconsistent with what you’re saying. Moreover, the “implicit” Latin-1 decode you describe mismatches how Perl implements an explicit Latin-1 decode (i.e., adds the UTF8 flag).

The UTF8 flag doesn't mark a SV (with PV) as a character string.  A SV
(with PV) without the UTF8 flag may be a character string.

> 
> What I propose (“strictstrings”) is an opt-in mode of operation where Perl no longer would attempt to interpret un-decode()d strings as Latin-1. Everything that handles strings as text would have to explicitly decode/encode. Perl would thus more naturally interact with Python, JavaScript, Sereal (see below), CBOR, WebSocket, and whatever other popular technologies nowadays distinguish explicitly between text and binary.
> 
> Like “use strict”, it wouldn’t break any existing code since it would be opt-in.

Well, first we'd need a separate byte or character string type.

> 
> > 
> >> then is Sereal wrong for attempting to distinguish text from binary,
> >> apparently using this flag?
> > 
> > As far as I can see, Sereal does not advertise that it makes any
> > distinction between character strings and octet strings.  If it did,
> > it would be wrong for it to base such a distinction on perl's SvUTF8
> > flag.  What Sereal does do is preserve the internal encoding of strings
> > across serialisation.  This is more conservative than is required to be
> > considered correct, but is not in any way wrong.
> 
> Sereal does indeed distinguish explicitly between character and octet strings:
> 
> https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod
> 
> In this regard it’s consistent with the other such technologies that I mentioned above.

It calls the BYTES strings "binary/latin1 string", so to me it looks
like it only distinguishes on the encoding, not on whether it's some
special binary type.

Tony

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About