develooper Front page | perl.perl5.porters | Postings from October 2021

Re: "use v5.36.0" should imply ASCII source

Thread Previous | Thread Next
Dan Book
October 3, 2021 19:47
Re: "use v5.36.0" should imply ASCII source
Message ID:
On Sun, Oct 3, 2021 at 2:57 PM Ricardo Signes <>

> On Mon, Aug 16, 2021, at 8:00 AM, Graham Knop wrote:
> On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <>
> wrote:>
> At the PSC, we had a long talk about this, and another proposal was made:
> We introduce a new stricture, which I'll call "source_encoding".  Under
> "use strict 'source_encoding'", the compiler will raise an exception when
> the source contains non-ASCII content unless the utf8 pragma is in effect.
> The error raised can drive the programmer to documentation explaining the
> various trade-offs.  That is: you can turn on utf8 and deal with how this
> affects your I/O, or you can disable the stricture, or you can restate your
> non-ASCII content as ASCII by using escaping constructs.
> After thinking about this again, I had another idea.
> The reason implying 'use utf8' is a problem is because of the impact it
> has on string semantics. Maybe we can just have it not impact string
> semantics. Make 'use v5.36.0;' decode the source as UTF-8, but store string
> literals as byte strings rather than characters. The strings would still be
> required to be UTF-8 encoded, but would be stored with the utf8 flag off.
> This would allow using UTF-8 encoded content in comments, Pod, or even in
> function names, but would not create the confusion with strings and IO.
> I said I'd write a reply to this and I didn't.  *Mea culpa*.
> I think there are two big questions, here:
> *ONE:*  What's the end state we'd like to get to?
> *TWO:*  What's a good next step, keeping in mind that we might not ever
> get past that next step?
> My take is this:  The end state I'd like is that strings are in one of
> three states:  declared text, declared bytes, unknown.  Semantics exist for
> how to combine these and deal with I/O discipline.  The source code is
> Unicode and string literals are assumed to be text.  A new string literal
> syntax exists for byte strings, like qb"...".
> For my money, a useful next step is that we encourage people to opt-in to
> "source code is unicode and string literals are text."  This means that the
> programmer is then responsible for thinking about how this will affect
> their I/O.  That concern is already there, we're just pushing around the
> complexity like a lump under the rug.  I think this push is a good one.  It
> lets us enable non-ASCII syntax, and it's pretty well understood.  Also, we
> already have something for qb"...." in the form of "do { use bytes; qq{...}
> }" but we could probably add a qb, too, if we needed it.

"use bytes" is an abstraction breakage, not an interface, so I would prefer
the qb alternative, unless and until "use bytes" did nothing other than
what "no utf8" currently does (but that could be an alternative for your

I agree very much with the end state proposed. I like the proposed next
step but I don't know how we get there. Even spreading understanding of the
current semantics is an uphill battle; too many people just don't
understand encoding, and that has to be baked into our approach. I think it
is possible, but not easy, to sufficiently document a new assumption for
whatever shape this feature may take. It's problematic that making
"assumption failures" reliably obvious when they occur is difficult to
impossible, ironically the sort of problem we are trying to fix here. I
don't have a conclusion here except that the most useful option won't
necessarily be the most expected (nor is the current state).


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About