develooper Front page | perl.perl5.porters | Postings from August 2021

Re: "use v5.36.0" should imply ASCII source

Thread Previous | Thread Next
From:
Darren Duncan
Date:
August 6, 2021 17:35
Subject:
Re: "use v5.36.0" should imply ASCII source
Message ID:
a8805a05-ab29-f918-4a2d-5e1bbfced289@darrenduncan.net
On 2021-08-06 8:22 a.m., Ricardo Signes wrote:
> At the PSC, we had a long talk about this, and another proposal was made:
> 
> We introduce a new stricture, which I'll call "source_encoding".  Under "use 
> strict 'source_encoding'", the compiler will raise an exception when the source 
> contains non-ASCII content unless the utf8 pragma is in effect.  The error 
> raised can drive the programmer to documentation explaining the various 
> trade-offs.  That is: you can turn on utf8 and deal with how this affects your 
> I/O, or you can disable the stricture, or you can restate your non-ASCII content 
> as ASCII by using escaping constructs.
> 
> I'm not /sure/ this is an improvement, but I think it is.  This prevents the "I 
> forgot to add utf8 and so only discovered after runtime that I have 
> doubly-encoded my output" bug.

+1

Personally I feel that this change is a great improvement, assuming I understand 
it right.

So just to be clear, when you say ASCII, you mean pure 7-bit ASCII, which is a 
proper subset of both UTF-8 and all the Latin encodings, and thus any source 
files written in that will "just work" in both the most common Unicode AND 
non-Unicode environments.

Would your new on as part of use 5.36 stricture then be failing every source 
file that has any octet with a 1 in the 8th bit when that file doesn't also have 
an explicit declaration of source encoding?

Because that is what I would expect given what you said.

For my part, I expressly designed my portable data format MUON 
https://github.com/muldis/Muldis_Object_Notation/blob/master/spec/Muldis_Object_Notation_Syntax_Plain_Text.md 
so that the non-7-bit-ASCII character repertoire is forbidden literally in a 
file except within quoted character string literals, and so one can parse 
everything outside the quoted strings, the actual document structure, completely 
without even having to know what the encoding is (it can be done in binary 
mode), at least between UTF-8 vs Latin etc (and even for encodings that aren't), 
and decoding the inside of strings is deferrable.

-- Darren Duncan

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About