Front page | perl.perl5.porters |
Postings from January 2022
Re: RFC: Rename the “UTF8” flag
Thread Previous
|
Thread Next
From:
Leon Timmermans
Date:
January 29, 2022 02:53
Subject:
Re: RFC: Rename the “UTF8” flag
Message ID:
CAHhgV8hpNu4pJxSALWPOEKu1Q4R8Hg1L=KX8J0p0U+T1ZaNE+Q@mail.gmail.com
On Fri, Jan 28, 2022 at 4:26 PM Felipe Gasper <felipe@felipegasper.com>
wrote:
> https://github.com/FGasper/perl-rfcs/blob/rfc10_utf8_rename/rfcs/rfc0010.md
>
> ---------
>
> # Rename the âUTF8â Flag
>
> ## Preamble
>
> ```
> Author: FELIPE
> Sponsor:
> ID: ?
> Status: Draft
> ```
>
> ## Abstract
>
> Perlâs âUTF8 flagâ confuses Perl users, and even occasionally its
> maintainers. This RFC proposes to rename it in source code and
> documentation, retaining old names as aliases to avoid breaking callers
> of Perlâs C API (XS modules & embedders).
>
> This RFC proposes the replacement term âheavyâ: `SvHEAVY`, etc.
>
> ## Motivation
>
> The âUTF8â moniker for this flag confuses people in at least three
> significant ways:
>
> - Many misconstrue the flag as indicating a âUTF-8 stringâ; in fact,
> when a Perl application encodes a string in UTF-8 the resulting scalar
> usually _lacks_ this flag.
> The inverse is also true: decoded/character strings typically
> _enable_ the flag. For those who work with Perl in C it makes some
> degree of sense to refer to âUTF8â-flagged strings as âUTF-8 stringsâ,
> but in a pure-Perl context this nomenclature encourages Perl users to
> consider
> Perl internals and effects a disparity in what two closely-related
> groups (Perl users and Perl maintainers) would logically call a
> âUTF-8 stringâ. The Perl community at large will benefit from there
> being exactly one meaning for the term âUTF-8 stringâ.
>
> - Some misconstrue the flag as indicating text vs. bytes--which, to be
> fair, Perl itself historically has done!
>
> - Itâs technically wrong. The encoding that it signifies is not, in fact,
> UTF-8, but its âgeneralizedâ variant, which encodes any code point that
> the algorithm can tolerate. Proper UTF-8 forbids any code point above
> 0x10ffff, for example, while Perl will happily store code
> points up to 0x7fffffffffffffff (2^63 - 1).
>
> (cf. https://simonsapin.github.io/wtf-8/#generalized-utf8)
>
> Encode.pmâs documentation distinguishes
> [`UTF-8` from `UTF8`](
> https://metacpan.org/pod/Encode#UTF-8-vs.-utf8-vs.-UTF8),
> the latter indicating the generalized
> variant that Perl uses internally. This distinction is too subtleâand
> too Perl-specificâto help anyone who doesnât already intimately know
> Perl and its character encoding âgotchasâ.
>
> ## Rationale
>
> Renaming this flag will achieve several benefits:
>
> 1. The mistaken belief that Perl uses UTF-8 internally will recede.
>
> 2. The term âheavyâ is more abstract than the well-known term âUTF-8â.
> That abstract quality will discourage Perl programmers from building
> application logic atop this part of Perlâs implementation.
>
> 3. The term âUTF-8 stringâ will be less sensible to use in reference
> to Perlâs internals since Perl will provide an official replacement
> (âheavy stringâ). This will help to prevent confusion
> when discussing encoding and related matters; having the terms
> âheavy UTF-8 stringâ, âheavy Unicode stringâ, ânon-heavy UTF-8 stringâ,
> and ânon-heavy Unicode stringâ will clarify matters where
> the current ambiguity of âUTF-8 stringâ impedes communication.
>
> ## Specification
>
> The following renames are proposed; in each case the old name
> should remain as an alias for the new (with appropriate indications
> in documentation):
>
> - `SVf_UTF8` -> `SVf_HEAVY`
> - `SvUTF8` -> `SvHEAVY`
> - `SvUTF8_on` -> `SvHEAVY_on`
> - `SvUTF8_off` -> `SvHEAVY_off`
> - `SvPOK_only_UTF8` -> `SvPOK_only_HEAVY`
> - `HeUTF8` -> `HeHEAVY`
> - `HvNAMEUTF8` -> `HvNAMEUTF8`
> - `HvENAMEUTF8` -> `HvENAMEHEAVY`
> - `PadnameUTF8` -> `PadnameHEAVY`
> - `sv_utf8_upgrade` -> `sv_to_heavy`
> - `sv_utf8_upgrade_flags` -> `sv_to_heavy_flags`
> - `sv_utf8_upgrade_flags_grow` -> `sv_to_heavy_flags_grow`
> - `sv_utf8_upgrade_nomg` -> `sv_to_heavy_nomg`
> - `sv_utf8_downgrade` -> `sv_from_heavy`
>
> The following changes are proposed:
>
> - `sv_dump` should output `HEAVY` rather than `UTF8`.
>
> Controls for input/output are **NOT** proposed for rename.
> This includes the likes of `COPHH_KEY_UTF8`, `SvPVutf8`, etc.
>
> Areas of uncertainty:
>
A change like this would involve a large amount of change in our code (and
more functions/macros than just the ones you mention), as well as
complicating our documentation (because suddenly all these things have two
names). A change like that would need significant benefit to be worth all
that headache.
And I don't see that benefit. As far as I can tell, the argument is really
more about ideological purity than any practical advantages. One of the two
internal encodings of Perl is (non-strict) UTF-8, and almost any code
dealing with this distinction will have to have knowledge of what encoding
is being used. Calling it "wide" or "heavy" might have made sense if we
could hide that, but that's exactly what we can't do.
This sort of thing makes sense in Perl-land, because we very much do want
to hide internal encoding there. But in C-land we can't, and we shouldn't
pretend otherwise.
In summary, I believe the cost of this change is significantly higher than
the benefit, and we shouldn't pursue this.
Leon
Thread Previous
|
Thread Next