develooper Front page | perl.perl5.porters | Postings from August 2021

Re: "use v5.36.0" should imply ASCII source

Thread Previous | Thread Next
Graham Knop
August 16, 2021 12:00
Re: "use v5.36.0" should imply ASCII source
Message ID:
On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <> wrote:
> Porters,
> I recently posted the suggestion that "use v5.36.0" should imply "use utf8", which led to a pretty large thread in which Felipe Gasper repeatedly said "This is going to make things worse, not better."  I spent a lot of time grumbling about this to myself, figuring out exactly how to rebut this, and then deciding that I tentatively, partly, agreed with him.
> We want each improvement to be a ratcheting up in language usability, when possible, rather than "we made things worse so we could make them better."  At present, because we don't (and can't) know whether a string is text or bytes, we don't (and can't) automatically encode it when it hits a bytestream.  We also don't know reliably whether a given output handle is already expecting to do that encoding for us.
> I am 100% certain that adding "use utf8" to the feature bundle would be better for me, but I already have a pretty strong grasp of the I/O model of Perl.  I'm not sure it's better enough for everybody.
> At the PSC, we had a long talk about this, and another proposal was made:
> We introduce a new stricture, which I'll call "source_encoding".  Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect.  The error raised can drive the programmer to documentation explaining the various trade-offs.  That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
> I'm not sure this is an improvement, but I think it is.  This prevents the "I forgot to add utf8 and so only discovered after runtime that I have doubly-encoded my output" bug.
> --
> rjbs

After thinking about this again, I had another idea.

The reason implying 'use utf8' is a problem is because of the impact
it has on string semantics. Maybe we can just have it not impact
string semantics. Make 'use v5.36.0;' decode the source as UTF-8, but
store string literals as byte strings rather than characters. The
strings would still be required to be UTF-8 encoded, but would be
stored with the utf8 flag off. This would allow using UTF-8 encoded
content in comments, Pod, or even in function names, but would not
create the confusion with strings and IO.

This seems possibly hard to document, which may indicate that it is a
terrible idea.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About