develooper Front page | perl.perl5.porters | Postings from August 2021

Re: "use v5.36.0" should imply UTF-8 encoded source Leon Timmermans<fawaka@gmail.com>

Thread Previous | Thread Next
From:
Salvador Fandiño
Date:
August 3, 2021 09:01
Subject:
Re: "use v5.36.0" should imply UTF-8 encoded source Leon Timmermans<fawaka@gmail.com>
Message ID:
99fef1e9-486d-9bfe-45b0-28895a4042bf@gmail.com
On 2/8/21 2:34, Felipe Gasper wrote:
> 
> 
>> On Aug 1, 2021, at 10:23 AM, Leon Timmermans <fawaka@gmail.com> wrote:
>>
>> Code is not binary, it is text. E.g.:
>>
>> use 5.010;
>> { no utf8;  say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" : "no" };
>> { use utf8; say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" : "no" };
>>
>> The status quo is only reasonable in that 95% of all code is actually ASCII, so it usually doesn't matter.
> 
> Code is indeed text, but this is not reasonable:
> 
>> perl -Mutf8 -e'print "é"'
> �
> 
> … particularly in contrast to this:
> 
>> echo é | perl -Mutf8 -e 'print <>'
> é
> 
> … and these:
> 
>> node -e 'console.log("é")'
> é
> 
>> python -c 'print("é")'
> é
> 
>> ruby -e 'puts "é"'
> é
> 
>> echo '<?php print "é" ?>' | php
> é
> 
>> echo | awk '{print "é"}'
> é
> 
>> julia -e'print("é")'
> é
> 
>> lua -e'print "é"'
> é
> 
> 
> For Unicode-aware applications it is indeed useful to auto-decode the strings, but is it really worth making Perl’s “modern default” the exceptionally weird behaviour of making:
> 
> perl -E'print "¡Hola, mundo!"'
> 
> … *not* print the given text correctly?
> 
> It just doesn’t seem a very workable “modern default”. How feasible, instead, would something like the following be:
> 
> ------
> 
> 1. Devote 2 bits of each SV to storing whether the PV is text or bytes:
> 
>      0 0 = unknown
>      0 1 = text
>      1 0 = bytes
>      1 1 = reserved/unused

IMO that's not the correct way to approach the problem here.

Perl already has PerlIO that allows transparent encoding/decoding of 
data on some IO interfaces, and that support should be expanded to 
support all of them.

Otherwise you are asking the programmer to do that translation 
explicitly every time some data goes through any builtin doing IO, as in:

mkdir do_encoding($dirname);

That doesn't make sense at all.

What we need is to add proper support for the transparent translation of 
data between the internal representation and the outside encoding 
everywhere. And that means:

a) Adding this translation feature to all the builtins doing IO
b) Adding a mechanism so that the developer can configure it (for 
instance, set filesystem encoding).
c) Infer sane defaults from the environment (utf8 has been the default 
encoding in most Linux/Unix systems for the last two decades, but Perl 
still expects latin1 from STDIO!)

And regarding (c), that's also why most of your examples above work. 
Your terminal sends and expects utf-8 encoded data from perl but perl 
expects/sends latin1. It just happens that in your examples it is 
consistently wrong, but there are myriads of other cases where it isn't.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About