Front page | perl.perl5.porters |
Postings from January 2020
Re: ???strict??? strings?
Thread Previous
|
Thread Next
From:
Tony Cook
Date:
January 5, 2020 22:55
Subject:
Re: ???strict??? strings?
Message ID:
20200105225534.GA5228@mars.tony.develop-help.com
On Sun, Jan 05, 2020 at 01:05:55PM -0500, Felipe Gasper wrote:
>
> > On Jan 5, 2020, at 10:07 AM, Zefram <zefram@fysh.org> wrote:
> >
> > Felipe Gasper wrote:
> >> Is there any supported text-decode operation, as per the
> >> input-decode-work-encode-output workflow described in `perlunitut`,
> >> that doesn't set that flag?
> >
> > Yes. Anyone who knows they're decoding Latin-1 is free to do it by *not*
> > calling any decoding function. In general, anything that decodes to
> > any subset of the Latin-1 character repertoire is free to represent its
> > result in the internal Latin-1 encoding rather than the internal UTF-8.
> > Plenty of code that never generates non-Latin-1 characters does yield
> > downgraded results.
>
> The workflow you’re describing--considering a non-decode as equivalent to decoding as Latin-1--violates the workflow that `perlunitut` prescribes. At least, I submit that *most* people who read `perlunitut` would think that document inconsistent with what you’re saying. Moreover, the “implicit” Latin-1 decode you describe mismatches how Perl implements an explicit Latin-1 decode (i.e., adds the UTF8 flag).
The UTF8 flag doesn't mark a SV (with PV) as a character string. A SV
(with PV) without the UTF8 flag may be a character string.
>
> What I propose (“strictstrings”) is an opt-in mode of operation where Perl no longer would attempt to interpret un-decode()d strings as Latin-1. Everything that handles strings as text would have to explicitly decode/encode. Perl would thus more naturally interact with Python, JavaScript, Sereal (see below), CBOR, WebSocket, and whatever other popular technologies nowadays distinguish explicitly between text and binary.
>
> Like “use strict”, it wouldn’t break any existing code since it would be opt-in.
Well, first we'd need a separate byte or character string type.
>
> >
> >> then is Sereal wrong for attempting to distinguish text from binary,
> >> apparently using this flag?
> >
> > As far as I can see, Sereal does not advertise that it makes any
> > distinction between character strings and octet strings. If it did,
> > it would be wrong for it to base such a distinction on perl's SvUTF8
> > flag. What Sereal does do is preserve the internal encoding of strings
> > across serialisation. This is more conservative than is required to be
> > considered correct, but is not in any way wrong.
>
> Sereal does indeed distinguish explicitly between character and octet strings:
>
> https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod
>
> In this regard it’s consistent with the other such technologies that I mentioned above.
It calls the BYTES strings "binary/latin1 string", so to me it looks
like it only distinguishes on the encoding, not on whether it's some
special binary type.
Tony
Thread Previous
|
Thread Next