develooper Front page | perl.perl5.porters | Postings from October 2021

Re: "use v5.36.0" should imply ASCII source

Thread Previous | Thread Next
Ricardo Signes
October 3, 2021 18:56
Re: "use v5.36.0" should imply ASCII source
Message ID:
On Mon, Aug 16, 2021, at 8:00 AM, Graham Knop wrote:
> On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <> wrote:>
>> At the PSC, we had a long talk about this, and another proposal was made:
>> We introduce a new stricture, which I'll call "source_encoding".  Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect.  The error raised can drive the programmer to documentation explaining the various trade-offs.  That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
> After thinking about this again, I had another idea.
> The reason implying 'use utf8' is a problem is because of the impact it has on string semantics. Maybe we can just have it not impact string semantics. Make 'use v5.36.0;' decode the source as UTF-8, but store string literals as byte strings rather than characters. The strings would still be required to be UTF-8 encoded, but would be stored with the utf8 flag off. This would allow using UTF-8 encoded content in comments, Pod, or even in function names, but would not create the confusion with strings and IO.

I said I'd write a reply to this and I didn't.  *Mea culpa*.

I think there are two big questions, here:

*ONE:*  What's the end state we'd like to get to?

*TWO:*  What's a good next step, keeping in mind that we might not ever get past that next step?

My take is this:  The end state I'd like is that strings are in one of three states:  declared text, declared bytes, unknown.  Semantics exist for how to combine these and deal with I/O discipline.  The source code is Unicode and string literals are assumed to be text.  A new string literal syntax exists for byte strings, like `qb"..."`.  

For my money, a useful next step is that we encourage people to opt-in to "source code is unicode and string literals are text."  This means that the programmer is then responsible for thinking about how this will affect their I/O.  That concern is already there, we're just pushing around the complexity like a lump under the rug.  I think this push is a good one.  It lets us enable non-ASCII syntax, and it's pretty well understood.  Also, we already have something for qb"...." in the form of "do { use bytes; qq{...} }" but we could probably add a qb, too, if we needed it.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About