develooper Front page | perl.perl5.porters | Postings from January 2020

Re: ???strict??? strings?

Thread Previous | Thread Next
Felipe Gasper
January 5, 2020 18:06
Re: ???strict??? strings?
Message ID:

> On Jan 5, 2020, at 10:07 AM, Zefram <> wrote:
> Felipe Gasper wrote:
>> Is there any supported text-decode operation, as per the
>> input-decode-work-encode-output workflow described in `perlunitut`,
>> that doesn't set that flag?
> Yes.  Anyone who knows they're decoding Latin-1 is free to do it by *not*
> calling any decoding function.  In general, anything that decodes to
> any subset of the Latin-1 character repertoire is free to represent its
> result in the internal Latin-1 encoding rather than the internal UTF-8.
> Plenty of code that never generates non-Latin-1 characters does yield
> downgraded results.

The workflow you’re describing--considering a non-decode as equivalent to decoding as Latin-1--violates the workflow that `perlunitut` prescribes. At least, I submit that *most* people who read `perlunitut` would think that document inconsistent with what you’re saying. Moreover, the “implicit” Latin-1 decode you describe mismatches how Perl implements an explicit Latin-1 decode (i.e., adds the UTF8 flag).

What I propose (“strictstrings”) is an opt-in mode of operation where Perl no longer would attempt to interpret un-decode()d strings as Latin-1. Everything that handles strings as text would have to explicitly decode/encode. Perl would thus more naturally interact with Python, JavaScript, Sereal (see below), CBOR, WebSocket, and whatever other popular technologies nowadays distinguish explicitly between text and binary.

Like “use strict”, it wouldn’t break any existing code since it would be opt-in.

>> then is Sereal wrong for attempting to distinguish text from binary,
>> apparently using this flag?
> As far as I can see, Sereal does not advertise that it makes any
> distinction between character strings and octet strings.  If it did,
> it would be wrong for it to base such a distinction on perl's SvUTF8
> flag.  What Sereal does do is preserve the internal encoding of strings
> across serialisation.  This is more conservative than is required to be
> considered correct, but is not in any way wrong.

Sereal does indeed distinguish explicitly between character and octet strings:

In this regard it’s consistent with the other such technologies that I mentioned above.

Under “strictstrings” mode, Perl would be able to act as one of those. I think it would simplify life for new Perl developers--who, nowadays, likely are more familiar with JS or Python--and reduce the frequency of “surprise” encoding problems for everyone.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About