Front page | perl.perl5.porters |
Postings from January 2022
Re: RFC: Rename the “UTF8” flag
January 29, 2022 03:42
Re: RFC: Rename the “UTF8” flag
Message ID: CANgJU+WGUXUteZOAdRWWPraJ2-xg77AQxiM=1ybAAQFWo+3hLQ@mail.gmail.com
On Sat, 29 Jan 2022, 10:54 Leon Timmermans, <email@example.com> wrote:
> On Fri, Jan 28, 2022 at 4:26 PM Felipe Gasper <firstname.lastname@example.org>
>> # Rename the âUTF8â Flag
>> ## Preamble
>> Author: FELIPE
>> ID: ?
>> Status: Draft
>> ## Abstract
>> Perlâs âUTF8 flagâ confuses Perl users, and even occasionally its
>> maintainers. This RFC proposes to rename it in source code and
>> documentation, retaining old names as aliases to avoid breaking callers
>> of Perlâs C API (XS modules & embedders).
>> This RFC proposes the replacement term âheavyâ: `SvHEAVY`, etc.
>> ## Motivation
>> The âUTF8â moniker for this flag confuses people in at least three
>> significant ways:
>> - Many misconstrue the flag as indicating a âUTF-8 stringâ; in fact,
>> when a Perl application encodes a string in UTF-8 the resulting scalar
>> usually _lacks_ this flag.
>> The inverse is also true: decoded/character strings typically
>> _enable_ the flag. For those who work with Perl in C it makes some
>> degree of sense to refer to âUTF8â-flagged strings as âUTF-8 stringsâ,
>> but in a pure-Perl context this nomenclature encourages Perl users to
>> Perl internals and effects a disparity in what two closely-related
>> groups (Perl users and Perl maintainers) would logically call a
>> âUTF-8 stringâ. The Perl community at large will benefit from there
>> being exactly one meaning for the term âUTF-8 stringâ.
>> - Some misconstrue the flag as indicating text vs. bytes--which, to be
>> fair, Perl itself historically has done!
>> - Itâs technically wrong. The encoding that it signifies is not, in fact,
>> UTF-8, but its âgeneralizedâ variant, which encodes any code point that
>> the algorithm can tolerate. Proper UTF-8 forbids any code point above
>> 0x10ffff, for example, while Perl will happily store code
>> points up to 0x7fffffffffffffff (2^63 - 1).
>> (cf. https://simonsapin.github.io/wtf-8/#generalized-utf8)
>> Encode.pmâs documentation distinguishes
>> [`UTF-8` from `UTF8`](
>> the latter indicating the generalized
>> variant that Perl uses internally. This distinction is too subtleâand
>> too Perl-specificâto help anyone who doesnât already intimately know
>> Perl and its character encoding âgotchasâ.
>> ## Rationale
>> Renaming this flag will achieve several benefits:
>> 1. The mistaken belief that Perl uses UTF-8 internally will recede.
>> 2. The term âheavyâ is more abstract than the well-known term âUTF-8â.
>> That abstract quality will discourage Perl programmers from building
>> application logic atop this part of Perlâs implementation.
>> 3. The term âUTF-8 stringâ will be less sensible to use in reference
>> to Perlâs internals since Perl will provide an official replacement
>> (âheavy stringâ). This will help to prevent confusion
>> when discussing encoding and related matters; having the terms
>> âheavy UTF-8 stringâ, âheavy Unicode stringâ, ânon-heavy UTF-8 stringâ,
>> and ânon-heavy Unicode stringâ will clarify matters where
>> the current ambiguity of âUTF-8 stringâ impedes communication.
>> ## Specification
>> The following renames are proposed; in each case the old name
>> should remain as an alias for the new (with appropriate indications
>> in documentation):
>> - `SVf_UTF8` -> `SVf_HEAVY`
>> - `SvUTF8` -> `SvHEAVY`
>> - `SvUTF8_on` -> `SvHEAVY_on`
>> - `SvUTF8_off` -> `SvHEAVY_off`
>> - `SvPOK_only_UTF8` -> `SvPOK_only_HEAVY`
>> - `HeUTF8` -> `HeHEAVY`
>> - `HvNAMEUTF8` -> `HvNAMEUTF8`
>> - `HvENAMEUTF8` -> `HvENAMEHEAVY`
>> - `PadnameUTF8` -> `PadnameHEAVY`
>> - `sv_utf8_upgrade` -> `sv_to_heavy`
>> - `sv_utf8_upgrade_flags` -> `sv_to_heavy_flags`
>> - `sv_utf8_upgrade_flags_grow` -> `sv_to_heavy_flags_grow`
>> - `sv_utf8_upgrade_nomg` -> `sv_to_heavy_nomg`
>> - `sv_utf8_downgrade` -> `sv_from_heavy`
>> The following changes are proposed:
>> - `sv_dump` should output `HEAVY` rather than `UTF8`.
>> Controls for input/output are **NOT** proposed for rename.
>> This includes the likes of `COPHH_KEY_UTF8`, `SvPVutf8`, etc.
>> Areas of uncertainty:
> A change like this would involve a large amount of change in our code (and
> more functions/macros than just the ones you mention), as well as
> complicating our documentation (because suddenly all these things have two
> names). A change like that would need significant benefit to be worth all
> that headache.
> And I don't see that benefit. As far as I can tell, the argument is really
> more about ideological purity than any practical advantages. One of the two
> internal encodings of Perl is (non-strict) UTF-8, and almost any code
> dealing with this distinction will have to have knowledge of what encoding
> is being used. Calling it "wide" or "heavy" might have made sense if we
> could hide that, but that's exactly what we can't do.
> This sort of thing makes sense in Perl-land, because we very much do want
> to hide internal encoding there. But in C-land we can't, and we shouldn't
> pretend otherwise.
> In summary, I believe the cost of this change is significantly higher than
> the benefit, and we shouldn't pursue this.
Thank you, my objection is based on exactly the same rationale.