develooper Front page | perl.perl5.porters | Postings from August 2021

Re: "use v5.36.0" should imply UTF-8 encoded source

Thread Previous | Thread Next
From:
Karl Williamson
Date:
August 7, 2021 17:35
Subject:
Re: "use v5.36.0" should imply UTF-8 encoded source
Message ID:
8489dcc2-87d5-dba2-00aa-8fecf162bc40@khwilliamson.com
On 8/6/21 5:34 PM, Aaron Priven wrote:
>> On Aug 4, 2021, at 4:33 PM, Dan Book <grinnz@gmail.com 
>> <mailto:grinnz@gmail.com>> wrote:
>> They are "text by default" in the ASCII sense, not the Unicode sense. 
>> The :crlf layer is enabled by default on Windows and translates CR LF 
>> to LF, but there is no default translation of bytes to characters. So 
>> you need to use binmode or :raw to make a filehandle binary-compatible 
>> on Windows, but you also need to apply an :encoding layer if you want 
>> to read/write characters instead of bytes.
> 
> I don’t think it’s true that there’s no default treatment of bytes as 
> characters. By default, perl treats bytes as Latin-1. So if you open a 
> file without an encoding layer, read some data, and then output it to a 
> file opened with an encoding layer, that encoding layer will assume that 
> the data being output is in Latin-1, and convert that to characters 
> accordingly.

Perl doesn't treat bytes as Latin-1 by default.  It treats 
non-ASCII-range bytes as not being in any character set.  All such match 
\W in patterns, for example, and uc etc return the input unchanged. 
Feature unicode-strings is necessary to get a Latin-1 treatment, or 
converting to UTF-8.

> 
> So an open filehandle is, in perl, a text filehandle using encoding 
> Latin-1, unless a layer or binmode is used.  It wouldn’t be unreasonable 
> to decide that, in some future version of perl, an open filehandle would 
> be treated as a text filehandle using encoding UTF-8 instead.
> 
> The problem, of course, is that on some but not all operating systems 
> the text filehandle returned by open can be used as a binary filehandle 
> without loss. /Conceptually/ it’s a text filehandle, the meaning in the 
> perl language is that it’s a text filehandle, but people misuse it as a 
> binary one because there’s no actual breakage, as long as it’s only 
> running on those operating systems.
> 
> So I understand that in practice, it might well be more trouble than 
> it’s worth to have “use 5.036” or even “use v7” make perl default to 
> text filehandles using the UTF-8 encoding, instead of defaulting to text 
> filehandles using the Latin-1 encoding. But I think it’s worth considering.
> 
> -- 
> Aaron Priven, aaron@priven.com <mailto:aaron@priven.com>, 
> www.priven.com/aaron <http://www.priven.com/aaron>


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About