develooper Front page | perl.perl5.porters | Postings from August 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
August 30, 2021 14:22
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
E5298993-3E67-492E-AD4C-84BF8152139A@felipegasper.com

> On Aug 30, 2021, at 8:18 AM, Dave Mitchell <davem@iabyn.com> wrote:
> 
> Date: Mon, 30 Aug 2021 13:17:04 +0100
> From: Dave Mitchell <davem@iabyn.com>
> To: Felipe Gasper <felipe@felipegasper.com>
> Subject: Re: Pre-RFC: Rename SVf_UTF8 et al.
> Message-ID: <YSzMQJIeURS/AznY@iabyn.com>
> 
> On Wed, Aug 18, 2021 at 01:18:34PM -0400, Felipe Gasper wrote:
>> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance.
> 
> 
> The SVf_UTF8 flags has a clear and unambiguous meaning (apart from some
> historical bugs): in what manner the codepoints of a string are stored as
> a sequence of bytes in memory.
> 
> If people are confused by this, renaming it is only going to add to the
> cognitive load and confusion.

I’ve proposed some fixes for perlre.pod (https://github.com/Perl/perl5/pull/19087). These fix documentation bugs that crept in specifically because of the use of “UTF-8” to refer to “upgraded” strings. It confuses even Perl’s own maintainers.

The fact that “UTF-8 string” can mean two quite-different things causes lots of encoding bugs in the wild. The fact that Perl *can’t* help to fix these worsens the problem.

Ricardo sensed a problem here back in 2016: https://www.youtube.com/watch?v=TmTeXcEixEg&t=940s

… when he referred to the flag as WIDE, in part because the encoding in question is *not*, in fact, UTF-8. Then he said: “Some joker went ahead, and they called that the UTF-8 flag.” Chuckles ensued.

Benefits of changing the internal terminology:

- It clarifies “external”, Perl-visible encoding versus internal codepoint storage. Different terms for different things.
- More abstract terminology for the internals discourages folks from peeking behind the abstraction.
- It’s more correct. Proper UTF-8 forbids quite a lot that Perl’s “lax UTF-8” (by design) allows.

Thanks for reading.

-FG
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About