develooper Front page | perl.perl5.porters | Postings from January 2022

Re: RFC: Rename the “UTF8” flag

Thread Previous | Thread Next
Leon Timmermans
January 29, 2022 02:53
Re: RFC: Rename the “UTF8” flag
Message ID:
On Fri, Jan 28, 2022 at 4:26 PM Felipe Gasper <>

> ---------
> # Rename the “UTF8” Flag
> ## Preamble
> ```
> Author:  FELIPE
> Sponsor:
> ID:      ?
> Status:  Draft
> ```
> ## Abstract
> Perl’s “UTF8 flag” confuses Perl users, and even occasionally its
> maintainers. This RFC proposes to rename it in source code and
> documentation, retaining old names as aliases to avoid breaking callers
> of Perl’s C API (XS modules & embedders).
> This RFC proposes the replacement term “heavy”: `SvHEAVY`, etc.
> ## Motivation
> The “UTF8” moniker for this flag confuses people in at least three
> significant ways:
> - Many misconstrue the flag as indicating a “UTF-8 string”; in fact,
> when a Perl application encodes a string in UTF-8 the resulting scalar
> usually _lacks_ this flag.
> The inverse is also true: decoded/character strings typically
> _enable_ the flag. For those who work with Perl in C it makes some
> degree of sense to refer to “UTF8”-flagged strings as “UTF-8 strings”,
> but in a pure-Perl context this nomenclature encourages Perl users to
> consider
> Perl internals and effects a disparity in what two closely-related
> groups (Perl users and Perl maintainers) would logically call a
> “UTF-8 string”. The Perl community at large will benefit from there
> being exactly one meaning for the term “UTF-8 string”.
> - Some misconstrue the flag as indicating text vs. bytes--which, to be
> fair, Perl itself historically has done!
> - It’s technically wrong. The encoding that it signifies is not, in fact,
> UTF-8, but its “generalized” variant, which encodes any code point that
> the algorithm can tolerate. Proper UTF-8 forbids any code point above
> 0x10ffff, for example, while Perl will happily store code
> points up to 0x7fffffffffffffff (2^63 - 1).
> (cf.
>’s documentation distinguishes
> [`UTF-8` from `UTF8`](
> the latter indicating the generalized
> variant that Perl uses internally. This distinction is too subtle—and
> too Perl-specific—to help anyone who doesn’t already intimately know
> Perl and its character encoding “gotchas”.
> ## Rationale
> Renaming this flag will achieve several benefits:
> 1. The mistaken belief that Perl uses UTF-8 internally will recede.
> 2. The term “heavy” is more abstract than the well-known term “UTF-8”.
> That abstract quality will discourage Perl programmers from building
> application logic atop this part of Perl’s implementation.
> 3. The term “UTF-8 string” will be less sensible to use in reference
> to Perl’s internals since Perl will provide an official replacement
> (“heavy string”). This will help to prevent confusion
> when discussing encoding and related matters; having the terms
> “heavy UTF-8 string”, “heavy Unicode string”, “non-heavy UTF-8 string”,
> and “non-heavy Unicode string” will clarify matters where
> the current ambiguity of “UTF-8 string” impedes communication.
> ## Specification
> The following renames are proposed; in each case the old name
> should remain as an alias for the new (with appropriate indications
> in documentation):
> - `SVf_UTF8`        -> `SVf_HEAVY`
> - `SvUTF8`          -> `SvHEAVY`
> - `SvUTF8_on`       -> `SvHEAVY_on`
> - `SvUTF8_off`      -> `SvHEAVY_off`
> - `SvPOK_only_UTF8` -> `SvPOK_only_HEAVY`
> - `HeUTF8`          -> `HeHEAVY`
> - `HvNAMEUTF8`      -> `HvNAMEUTF8`
> - `PadnameUTF8`     -> `PadnameHEAVY`
> - `sv_utf8_upgrade`             -> `sv_to_heavy`
> - `sv_utf8_upgrade_flags`       -> `sv_to_heavy_flags`
> - `sv_utf8_upgrade_flags_grow`  -> `sv_to_heavy_flags_grow`
> - `sv_utf8_upgrade_nomg`        -> `sv_to_heavy_nomg`
> - `sv_utf8_downgrade` -> `sv_from_heavy`
> The following changes are proposed:
> - `sv_dump` should output `HEAVY` rather than `UTF8`.
> Controls for input/output are **NOT** proposed for rename.
> This includes the likes of `COPHH_KEY_UTF8`, `SvPVutf8`, etc.
> Areas of uncertainty:

A change like this would involve a large amount of change in our code (and
more functions/macros than just the ones you mention), as well as
complicating our documentation (because suddenly all these things have two
names). A change like that would need significant benefit to be worth all
that headache.

And I don't see that benefit. As far as I can tell, the argument is really
more about ideological purity than any practical advantages. One of the two
internal encodings of Perl is (non-strict) UTF-8, and almost any code
dealing with this distinction will have to have knowledge of what encoding
is being used. Calling it "wide" or "heavy" might have made sense if we
could hide that, but that's exactly what we can't do.

This sort of thing makes sense in Perl-land, because we very much do want
to hide internal encoding there. But in C-land we can't, and we shouldn't
pretend otherwise.

In summary, I believe the cost of this change is significantly higher than
the benefit, and we shouldn't pursue this.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About