On Mon, Aug 16, 2021, at 8:00 AM, Graham Knop wrote: > On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:> >> At the PSC, we had a long talk about this, and another proposal was made: >> >> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs. > > After thinking about this again, I had another idea. > > The reason implying 'use utf8' is a problem is because of the impact it has on string semantics. Maybe we can just have it not impact string semantics. Make 'use v5.36.0;' decode the source as UTF-8, but store string literals as byte strings rather than characters. The strings would still be required to be UTF-8 encoded, but would be stored with the utf8 flag off. This would allow using UTF-8 encoded content in comments, Pod, or even in function names, but would not create the confusion with strings and IO. I said I'd write a reply to this and I didn't. *Mea culpa*. I think there are two big questions, here: *ONE:* What's the end state we'd like to get to? *TWO:* What's a good next step, keeping in mind that we might not ever get past that next step? My take is this: The end state I'd like is that strings are in one of three states: declared text, declared bytes, unknown. Semantics exist for how to combine these and deal with I/O discipline. The source code is Unicode and string literals are assumed to be text. A new string literal syntax exists for byte strings, like `qb"..."`. For my money, a useful next step is that we encourage people to opt-in to "source code is unicode and string literals are text." This means that the programmer is then responsible for thinking about how this will affect their I/O. That concern is already there, we're just pushing around the complexity like a lump under the rug. I think this push is a good one. It lets us enable non-ASCII syntax, and it's pretty well understood. Also, we already have something for qb"...." in the form of "do { use bytes; qq{...} }" but we could probably add a qb, too, if we needed it. -- rjbsThread Previous | Thread Next