Front page | perl.perl5.porters |
Postings from January 2020
Re: ???strict??? strings?
From: Felipe Gasper
January 6, 2020 03:07
Re: ???strict??? strings?
Message ID: 2E81FBE0-F6BD-41E0-90D6-96590E6CF037@felipegasper.com
> On Jan 5, 2020, at 7:27 PM, Zefram via perl5-porters <firstname.lastname@example.org> wrote:
> Felipe Gasper wrote:
>> The workflow you're describing--considering a non-decode as
>> equivalent to decoding as Latin-1--violates the workflow that
>> `perlunitut` prescribes.
> No, it doesn't, precisely because Perl doesn't distinguish between
> a string of octets and a string of Latin-1 characters.
As I wrote earlier, I daresay few, if any, who are new to Perl and read `perlunitut` would think in this way. The document specifically says in “I/O flow” that, if the input is not binary, “you should decode it”. It even shows an example decode() of Latin-1. I think nearly anyone who comes to this problem afresh would think the document means that strings encoded in Latin-1 should be explicitly decoded before being handled as text.
If the intent truly is that forgoing an explicit decode with Latin-1 encoded binary is just as valid and encouraged of a workflow as an explicit decode, it would be nice if `perlunitut` were updated to make that clearer from the get-go. I’d offer to do it, but I’m still not sure that my mental model of all of this is what’s intended.
> This doesn't
> only happen with input streams that are in their entirety Latin-1 or
> ASCII characters; it is also common for strings of such characters to
> be extracted or decoded from larger file formats without setting the
> SvUTF8 flag. It is also normal for string literals in Perl source to
> be treated as character strings without any explicit decoding phase,
> and those produce non-SvUTF8 strings wherever possible.
Most string operations work perfectly well on undecoded strings. I myself rarely use decode/encode unless I have to interact with JSON.
>> What I propose ("strictstrings") is an opt-in mode of operation
> It's not feasible to opt into this mode, because strings cross module
> boundaries all the time, and in all sorts of roles. Any type distinction
> attached to strings will be lost by innocuous operations performed by
> unaware modules, and the behaviour of modules with respect to the type
> distinction would quickly become an API backcompat issue preventing
> modules acquiring the type distinction. This is completely unlike "use
> strict", which affects the interpretation of bits of code that are by
> definition completely localised within a single module.
A fair amount of existing Perl would not work with “strictstrings” mode, to be sure. But since the proposed mode would merely introduce new failure states, nothing that _does_ work with it would break without it, so couldn’t any existing code be rectified?
The new failure states may also expose subtle encoding bugs in existing code.