develooper Front page | perl.perl5.porters | Postings from July 2021

Re: "use v5.36.0" should imply UTF-8 encoded source

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
July 30, 2021 18:46
Subject:
Re: "use v5.36.0" should imply UTF-8 encoded source
Message ID:
641D6E0F-3904-434B-AF54-AD5D0232E4E0@felipegasper.com


> On Jul 30, 2021, at 2:27 PM, Leon Timmermans <fawaka@gmail.com> wrote:
> 
> On Fri, Jul 30, 2021 at 7:56 PM Felipe Gasper <felipe@felipegasper.com> wrote:
> 
> 
> > On Jul 30, 2021, at 1:48 PM, Leon Timmermans <fawaka@gmail.com> wrote:
> > 
> > On Fri, Jul 30, 2021 at 6:56 PM Felipe Gasper <felipe@felipegasper.com> wrote:
> > FWIW, I think this will regress Perl’s usability.
> > 
> > Probably the worst part about character encoding in Perl is that nothing indicates when you’ve over-encoded or under-encoded. But, at the very least everything right now is consistent by default: source code is parsed as bytes (“Latin-1”), and I/O happens as bytes. Thus, a “minimal-effort” approach to writing Perl will at least minimize the odds of encoding mismatches: you only run into trouble if you explicitly decode/encode.
> > 
> > If `use v5.36` is to disrupt that consistency by making source code UTF-8-decoded but *leaving* I/O as bytes, this seems likely to add another “shin-bumper” to use of Perl that doesn’t happen in languages that type byte strings differently from text strings.
> > 
> > So quick-and-simple things like `print "é"` will now, in “modern” Perl, break, with no indication of where/why until a human being comes along, notices the problem, and puts in the time to debug it.
> > 
> > It doesn't actually break. PerlIO will try to downgrade that for a non-:utf8 handle, or upgrade for a :utf8 handle.
> 
> It’ll downgrade it, but it won’t encode it, so you’ll get mojibake:
> 
> > perl -Mutf8 -e'print "é"'
> �
> 
> It will print mojibake as well if the script is latin-1 encoded. It's mojibake because the terminal is utf-8, but the IO handle is latin1.

FWIW I think it’s easier to think of the default I/O mode as “bytes” or “native” 8-bit encoding” rather than “Latin-1”. In that light it’s easier to see the status quo as the more reasonable default: we parse the code as bytes, and we print as bytes.

Changing it so that the (“modern”) default is to decode strings as UTF-8 but still output them as bytes seems likely to introduce lots of confusion, which will either a) discourage adoption of “use v5.36”, or b) discourage use of Perl at all:

Anti-Perler: Hey that new Perl script you wrote mangles our CEO’s name.
Perler: That’s weird … I used the modern defaults … wonder where the bug is …
Anti-Perler: Maybe you should just switch to $otherlang, where this stuff doesn’t happen.

-F
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About