develooper Front page | perl.perl5.porters | Postings from July 2021

Re: "use v5.36.0" should imply UTF-8 encoded source

Thread Previous | Thread Next
Felipe Gasper
July 30, 2021 17:56
Re: "use v5.36.0" should imply UTF-8 encoded source
Message ID:

> On Jul 30, 2021, at 1:48 PM, Leon Timmermans <> wrote:
> On Fri, Jul 30, 2021 at 6:56 PM Felipe Gasper <> wrote:
> FWIW, I think this will regress Perl’s usability.
> Probably the worst part about character encoding in Perl is that nothing indicates when you’ve over-encoded or under-encoded. But, at the very least everything right now is consistent by default: source code is parsed as bytes (“Latin-1”), and I/O happens as bytes. Thus, a “minimal-effort” approach to writing Perl will at least minimize the odds of encoding mismatches: you only run into trouble if you explicitly decode/encode.
> If `use v5.36` is to disrupt that consistency by making source code UTF-8-decoded but *leaving* I/O as bytes, this seems likely to add another “shin-bumper” to use of Perl that doesn’t happen in languages that type byte strings differently from text strings.
> So quick-and-simple things like `print "é"` will now, in “modern” Perl, break, with no indication of where/why until a human being comes along, notices the problem, and puts in the time to debug it.
> It doesn't actually break. PerlIO will try to downgrade that for a non-:utf8 handle, or upgrade for a :utf8 handle.

It’ll downgrade it, but it won’t encode it, so you’ll get mojibake:

> perl -Mutf8 -e'print "é"'

> It’s going to be particularly problematic with stuff like `mkdir "épée" because now we’re *really* expecting the SvPV bug--where we give the raw PV to the kernel/OS--to stick around. 
> That problem exists with or without this change. That said, I don't think I've ever seen a hard-coded non-ascii path in a program, I don't think this is much of an issue.

The problem exists, yes, but this change will make the bug that much more painful to fix.

I would wager that folks using Perl in the context of non-Latin languages (Cyrillic, CJK, &c.) will be more likely to hard-code non-ASCII paths. I personally mostly do it for testing. And, of course, the problem pertains not just to filesystem paths, but to any string we give to the kernel (e.g., args to exec()).

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About