develooper Front page | perl.perl5.porters | Postings from January 2022

Re: RFC: Rename the “UTF8” flag

Thread Previous | Thread Next
From:
demerphq
Date:
January 29, 2022 03:42
Subject:
Re: RFC: Rename the “UTF8” flag
Message ID:
CANgJU+WGUXUteZOAdRWWPraJ2-xg77AQxiM=1ybAAQFWo+3hLQ@mail.gmail.com
On Sat, 29 Jan 2022, 10:54 Leon Timmermans, <fawaka@gmail.com> wrote:

> On Fri, Jan 28, 2022 at 4:26 PM Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>>
>> https://github.com/FGasper/perl-rfcs/blob/rfc10_utf8_rename/rfcs/rfc0010.md
>>
>> ---------
>>
>> # Rename the “UTF8” Flag
>>
>> ## Preamble
>>
>> ```
>> Author:  FELIPE
>> Sponsor:
>> ID:      ?
>> Status:  Draft
>> ```
>>
>> ## Abstract
>>
>> Perl’s “UTF8 flag” confuses Perl users, and even occasionally its
>> maintainers. This RFC proposes to rename it in source code and
>> documentation, retaining old names as aliases to avoid breaking callers
>> of Perl’s C API (XS modules & embedders).
>>
>> This RFC proposes the replacement term “heavy”: `SvHEAVY`, etc.
>>
>> ## Motivation
>>
>> The “UTF8” moniker for this flag confuses people in at least three
>> significant ways:
>>
>> - Many misconstrue the flag as indicating a “UTF-8 string”; in fact,
>> when a Perl application encodes a string in UTF-8 the resulting scalar
>> usually _lacks_ this flag.
>> The inverse is also true: decoded/character strings typically
>> _enable_ the flag. For those who work with Perl in C it makes some
>> degree of sense to refer to “UTF8”-flagged strings as “UTF-8 strings”,
>> but in a pure-Perl context this nomenclature encourages Perl users to
>> consider
>> Perl internals and effects a disparity in what two closely-related
>> groups (Perl users and Perl maintainers) would logically call a
>> “UTF-8 string”. The Perl community at large will benefit from there
>> being exactly one meaning for the term “UTF-8 string”.
>>
>> - Some misconstrue the flag as indicating text vs. bytes--which, to be
>> fair, Perl itself historically has done!
>>
>> - It’s technically wrong. The encoding that it signifies is not, in fact,
>> UTF-8, but its “generalized” variant, which encodes any code point that
>> the algorithm can tolerate. Proper UTF-8 forbids any code point above
>> 0x10ffff, for example, while Perl will happily store code
>> points up to 0x7fffffffffffffff (2^63 - 1).
>>
>> (cf. https://simonsapin.github.io/wtf-8/#generalized-utf8)
>>
>> Encode.pm’s documentation distinguishes
>> [`UTF-8` from `UTF8`](
>> https://metacpan.org/pod/Encode#UTF-8-vs.-utf8-vs.-UTF8),
>> the latter indicating the generalized
>> variant that Perl uses internally. This distinction is too subtle—and
>> too Perl-specific—to help anyone who doesn’t already intimately know
>> Perl and its character encoding “gotchas”.
>>
>> ## Rationale
>>
>> Renaming this flag will achieve several benefits:
>>
>> 1. The mistaken belief that Perl uses UTF-8 internally will recede.
>>
>> 2. The term “heavy” is more abstract than the well-known term “UTF-8”.
>> That abstract quality will discourage Perl programmers from building
>> application logic atop this part of Perl’s implementation.
>>
>> 3. The term “UTF-8 string” will be less sensible to use in reference
>> to Perl’s internals since Perl will provide an official replacement
>> (“heavy string”). This will help to prevent confusion
>> when discussing encoding and related matters; having the terms
>> “heavy UTF-8 string”, “heavy Unicode string”, “non-heavy UTF-8 string”,
>> and “non-heavy Unicode string” will clarify matters where
>> the current ambiguity of “UTF-8 string” impedes communication.
>>
>> ## Specification
>>
>> The following renames are proposed; in each case the old name
>> should remain as an alias for the new (with appropriate indications
>> in documentation):
>>
>> - `SVf_UTF8`        -> `SVf_HEAVY`
>> - `SvUTF8`          -> `SvHEAVY`
>> - `SvUTF8_on`       -> `SvHEAVY_on`
>> - `SvUTF8_off`      -> `SvHEAVY_off`
>> - `SvPOK_only_UTF8` -> `SvPOK_only_HEAVY`
>> - `HeUTF8`          -> `HeHEAVY`
>> - `HvNAMEUTF8`      -> `HvNAMEUTF8`
>> - `HvENAMEUTF8`     -> `HvENAMEHEAVY`
>> - `PadnameUTF8`     -> `PadnameHEAVY`
>> - `sv_utf8_upgrade`             -> `sv_to_heavy`
>> - `sv_utf8_upgrade_flags`       -> `sv_to_heavy_flags`
>> - `sv_utf8_upgrade_flags_grow`  -> `sv_to_heavy_flags_grow`
>> - `sv_utf8_upgrade_nomg`        -> `sv_to_heavy_nomg`
>> - `sv_utf8_downgrade` -> `sv_from_heavy`
>>
>> The following changes are proposed:
>>
>> - `sv_dump` should output `HEAVY` rather than `UTF8`.
>>
>> Controls for input/output are **NOT** proposed for rename.
>> This includes the likes of `COPHH_KEY_UTF8`, `SvPVutf8`, etc.
>>
>> Areas of uncertainty:
>>
>
> A change like this would involve a large amount of change in our code (and
> more functions/macros than just the ones you mention), as well as
> complicating our documentation (because suddenly all these things have two
> names). A change like that would need significant benefit to be worth all
> that headache.
>
> And I don't see that benefit. As far as I can tell, the argument is really
> more about ideological purity than any practical advantages. One of the two
> internal encodings of Perl is (non-strict) UTF-8, and almost any code
> dealing with this distinction will have to have knowledge of what encoding
> is being used. Calling it "wide" or "heavy" might have made sense if we
> could hide that, but that's exactly what we can't do.
>
> This sort of thing makes sense in Perl-land, because we very much do want
> to hide internal encoding there. But in C-land we can't, and we shouldn't
> pretend otherwise.
>
> In summary, I believe the cost of this change is significantly higher than
> the benefit, and we shouldn't pursue this.
>

Thank you, my objection is based on exactly the same rationale.

Yves

>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About