develooper Front page | perl.perl5.porters | Postings from July 2021

Re: "use v5.36.0" should imply UTF-8 encoded source

Thread Previous | Thread Next
Felipe Gasper
July 31, 2021 20:16
Re: "use v5.36.0" should imply UTF-8 encoded source
Message ID:

> On Jul 31, 2021, at 4:16 AM, Yuki Kimoto <> wrote:
> 2021-7-31 16:17 Darren Duncan <>:
> On 2021-07-30 11:15 p.m., Yuki Kimoto wrote:
> > 2021-7-30 23:46 Ricardo Signes wrote:
> >     I propose that "use v5.36.0" should imply that the source code is,
> >     subsequently, UTF-8 encoded.
> > 
> >   At least after v5.38+.
> > 
> > It is good to change one by one.
> > 
> > I want to see the effect and hear the user experience of "use warnings" in the 
> > next release.
> I strongly disagree.  The warnings and utf8 are unrelated features.  These are 
> each also minor changes considering they are lexical.  Perl interpreter 
> development is already moving at a relatively glacial pace, there is no benefit 
> and a lot of downside of delaying the utf8 for a year just to see what people 
> say after a production with warnings is released.  The 5.36 is still about 9 
> months away, that is plenty of time for people to give feedback on either that 
> or the warnings.

Turning on warnings in the feature bundle will break things that worked under prior feature bundles, but the breakage will be visible and obvious.

Adding an auto-UTF-8-decode to all source text is a much more subtle breakage, and thus much more prone to confuse people. It’s basically the same type of change as making “my $foo = 123” parse the “123” in hex rather than decimal.

The proposal here is basically for “modern Perl” to make strings in the source code unable to be output as they are (integrally, that is). It seems *awfully* likely to confuse people. Even that aside, in, e.g., JavaScript or Python the interpreter could at least tell you, “hey, you’re trying to print a character string, and I don’t know what encoding you want.” Or, “whoa, that’s a byte string, and this output stream encodes to UTF-8.” Perl has no way of doing that.

Perl’s status quo is that all inputs are byte strings, and all outputs are byte strings. This is simple and consistent: until an application willingly interacts with something that needs or gives text strings (e.g., JSON), everything works similarly to “classic” C strings.

When we start worrying about “text”, though, confusion abounds: Perl can’t tell you when you’ve got the wrong “type”, and the language itself doesn’t even implement its own internal abstraction consistently (see CPAN’s Sys::Binmode). And how many interfaces out there neglect to document whether they expect/give encoded/decoded strings? Making “modern Perl” aggravate that further by defaulting to disparate encoding levels--inputs from the source will need encoding to be printed, but inputs from STDIN won’t … ?!?--will add even more “landmines”.

Decoding source code as UTF-8 makes tons of sense, but only *after* the critical first step is taken of teaching Perl to distinguish text from bytes. (I have ideas for how to achieve that, if there are folks here interested in discussing it further.) That way we can change the “modern” default, and, as with warnings, breakages will come with useful error messages that point to where the problem is and how to fix it.

As a side note, this will facilitate other, hugely useful improvements like making it practical to use Windows’s Unicode APIs, preventing double-encode/decode, etc.

Thanks to all who’ve read and considered.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About