develooper Front page | perl.perl5.porters | Postings from July 2021

Re: "use v5.36.0" should imply UTF-8 encoded source

Thread Previous | Thread Next
From:
Darren Duncan
Date:
July 31, 2021 21:26
Subject:
Re: "use v5.36.0" should imply UTF-8 encoded source
Message ID:
a0054493-e6dd-e809-134c-3eb2b9642569@darrenduncan.net
Thank you Felipe, your latest comment here is something I can much more easily 
get behind.

And I am actually a lot more of the same mind as what you expressed in the wider 
case.

I actually consider matters of source text encoding to be very distinct and very 
separate from all other matters of syntax.

 From a purist perspective I actually believe it is best for any declarations of 
source encoding to remain explicit and permanently separate from the "use vN;" etc.

The only reason I generally supported rolling a UTF-8 declaration into "use vN;" 
was making it easier for Perl users to avoid extra boilerplate in a common case.

On further thought I think I will downgrade to neutral my level of support on 
the proposal.

I have been thinking about these matters a lot for years in the context of my 
own independent language/format/etc such as 
https://github.com/muldis/Muldis_Object_Notation/blob/master/spec/Muldis_Object_Notation_Syntax_Plain_Text.md 
which is in progress.

In the context that I've been able to design something green field, I have 
further generalized what Perl and Raku have but many languages don't, about the 
program source code itself explicitly declaring what it is, so it can be 
interpreted most reliably as intended by the writer, rather than relying on 
external context.

In particular, source code supports 3 very distinct explicit declarations:

1. "script" - What the character encoding of the source is.  This is intended to 
disambiguate when there is no 100% reliable heuristic to determine it from 
analyzing the byte stream itself.  Parsers are expected to support UTF-8 (and 
hence also ASCII) at an absolute minimum, and others optionally.  Also parsers 
are in the general case always intended to take their input as octet strings and 
tokenizers would take and return octets rather than characters.

2. "syntax" - What concrete syntax or grammar or format applies to the file.  In 
the sense that say JSON or XML or YAML or SQL or whatever are syntaxes.

3. "model" - What data model applies to the file, loosely what data model type 
each literal etc maps to.  For example do Integer and Fraction literals map to 
distinct types or to the same type.

Now this is designed around static syntaxes where one can completely and 
unambiguously parse a source code string without any knowledge of user-defined 
operators or whatever, in contrast to Perl and Raku, where the parser itself 
changes how it interprets things as it goes along based on higher level 
user-defined things; in contrast my language/format intentionally doesn't do that.

Given that Perl is quite different and has its legacy, what I'm saying above has 
very limited applicability to the current Perl discussion, however I feel that 
the Perl community can still learn lessons from it.

-- Darren Duncan

On 2021-07-31 1:15 p.m., Felipe Gasper wrote:
> Turning on warnings in the feature bundle will break things that worked under prior feature bundles, but the breakage will be visible and obvious.
> 
> Adding an auto-UTF-8-decode to all source text is a much more subtle breakage, and thus much more prone to confuse people. It’s basically the same type of change as making “my $foo = 123” parse the “123” in hex rather than decimal.
> 
> The proposal here is basically for “modern Perl” to make strings in the source code unable to be output as they are (integrally, that is). It seems *awfully* likely to confuse people. Even that aside, in, e.g., JavaScript or Python the interpreter could at least tell you, “hey, you’re trying to print a character string, and I don’t know what encoding you want.” Or, “whoa, that’s a byte string, and this output stream encodes to UTF-8.” Perl has no way of doing that.
> 
> Perl’s status quo is that all inputs are byte strings, and all outputs are byte strings. This is simple and consistent: until an application willingly interacts with something that needs or gives text strings (e.g., JSON), everything works similarly to “classic” C strings.
> 
> When we start worrying about “text”, though, confusion abounds: Perl can’t tell you when you’ve got the wrong “type”, and the language itself doesn’t even implement its own internal abstraction consistently (see CPAN’s Sys::Binmode). And how many interfaces out there neglect to document whether they expect/give encoded/decoded strings? Making “modern Perl” aggravate that further by defaulting to disparate encoding levels--inputs from the source will need encoding to be printed, but inputs from STDIN won’t … ?!?--will add even more “landmines”.
> 
> Decoding source code as UTF-8 makes tons of sense, but only *after* the critical first step is taken of teaching Perl to distinguish text from bytes. (I have ideas for how to achieve that, if there are folks here interested in discussing it further.) That way we can change the “modern” default, and, as with warnings, breakages will come with useful error messages that point to where the problem is and how to fix it.
> 
> As a side note, this will facilitate other, hugely useful improvements like making it practical to use Windows’s Unicode APIs, preventing double-encode/decode, etc.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About