develooper Front page | perl.perl5.porters | Postings from June 2022

Re: tightening up source code encoding semantics

Thread Previous | Thread Next
Ricardo Signes
June 18, 2022 01:59
Re: tightening up source code encoding semantics
Message ID:
On Sat, Feb 26, 2022, at 23:56, Karl Williamson wrote:
> [ things about how automatic detection could work ]

I will restate, tersely, what I think Karl said.  I hope Karl can then say "yes, that's right [or close enough]" or "no."
 * if the choices are Latin-1 or UTF-8, It is possible to predict with high confidence which a line of input is
 * we can use this to avoid having to declare the encoding
 * if encoding is declared, and is at odds with what is detected, a warning (or error) could be issued
So, first off: is that about right?

Next:  I think this still requires that the program says "my source should be decoded at all".  I *do* agree with the assertion that we can "guess" whether input is UTF-8 or Latin-1, but that's not the only relevant question.  Imagine this program:
use v5.36;
my $str1 = "██████";
say $str;

Right now, no matter what content is actually in that string literal, the same bytes that were in the source will be sent to stdout.  Imagine that we say "We can detect that the string is UTF-8 bytes, so we decode the bytes in the string literal so that $str1 contains the Unicode codepoints encoded in it."  When we print that string, we will get a wide string warning, and we will deserve it.  This, more or less, is why this proposal ended up existing rather than the previous one to make "use vX" enable utf8.

It was Felipe G., I believe, who said that users would end up more confused when the [lack of] automatic filehandle discipline didn't match the implicit source decoding.  I think that claim was correct.  I think we'd do users a disservice if we built strings by decoding the source literals based on encoding detection — not because the detection will be wrong, but because right now there is a bytes-in/bytes-out expectation.

Karl:  Please tell me if you think I am way off base, here.

I *do* think this all leads to a more exciting possibility, though!

We *could* automatically detect source encoding, but forbid non-ASCII in string literals without declaration.  This would allow non-ASCII syntax freely, but would require users clarify that they know their literals will be decoded into codepoint strings rather than octet strings.  (If I wanted to keep banging the "adverbs on quote-like operators" drum, I would say that we could easily do this on a per-literal basis that way.)  I think the problem we're seeing here is the conflation of text and buffer types in Perl 5, and I feel like we're finding a nice way to smoosh the lump under the carpet into one place, but I don't think we can eliminate it just yet.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About