develooper Front page | perl.perl5.porters | Postings from January 2022

RFC: Rename the “UTF8” flag

Thread Next
From:
Felipe Gasper
Date:
January 28, 2022 15:26
Subject:
RFC: Rename the “UTF8” flag
Message ID:
29E8C973-35C4-40DF-A445-EB487DE9B182@felipegasper.com
https://github.com/FGasper/perl-rfcs/blob/rfc10_utf8_rename/rfcs/rfc0010.md

---------

# Rename the “UTF8” Flag

## Preamble

```
Author:  FELIPE
Sponsor:
ID:      ?
Status:  Draft
```

## Abstract

Perl’s “UTF8 flag” confuses Perl users, and even occasionally its
maintainers. This RFC proposes to rename it in source code and
documentation, retaining old names as aliases to avoid breaking callers
of Perl’s C API (XS modules & embedders).

This RFC proposes the replacement term “heavy”: `SvHEAVY`, etc.

## Motivation

The “UTF8” moniker for this flag confuses people in at least three
significant ways:

- Many misconstrue the flag as indicating a “UTF-8 string”; in fact,
when a Perl application encodes a string in UTF-8 the resulting scalar
usually _lacks_ this flag.
The inverse is also true: decoded/character strings typically
_enable_ the flag. For those who work with Perl in C it makes some
degree of sense to refer to “UTF8”-flagged strings as “UTF-8 strings”,
but in a pure-Perl context this nomenclature encourages Perl users to consider
Perl internals and effects a disparity in what two closely-related
groups (Perl users and Perl maintainers) would logically call a
“UTF-8 string”. The Perl community at large will benefit from there
being exactly one meaning for the term “UTF-8 string”.

- Some misconstrue the flag as indicating text vs. bytes--which, to be
fair, Perl itself historically has done!

- It’s technically wrong. The encoding that it signifies is not, in fact,
UTF-8, but its “generalized” variant, which encodes any code point that
the algorithm can tolerate. Proper UTF-8 forbids any code point above
0x10ffff, for example, while Perl will happily store code
points up to 0x7fffffffffffffff (2^63 - 1).

(cf. https://simonsapin.github.io/wtf-8/#generalized-utf8)

Encode.pm’s documentation distinguishes
[`UTF-8` from `UTF8`](https://metacpan.org/pod/Encode#UTF-8-vs.-utf8-vs.-UTF8),
the latter indicating the generalized
variant that Perl uses internally. This distinction is too subtle—and
too Perl-specific—to help anyone who doesn’t already intimately know
Perl and its character encoding “gotchas”.

## Rationale

Renaming this flag will achieve several benefits:

1. The mistaken belief that Perl uses UTF-8 internally will recede.

2. The term “heavy” is more abstract than the well-known term “UTF-8”.
That abstract quality will discourage Perl programmers from building
application logic atop this part of Perl’s implementation.

3. The term “UTF-8 string” will be less sensible to use in reference
to Perl’s internals since Perl will provide an official replacement
(“heavy string”). This will help to prevent confusion
when discussing encoding and related matters; having the terms
“heavy UTF-8 string”, “heavy Unicode string”, “non-heavy UTF-8 string”,
and “non-heavy Unicode string” will clarify matters where
the current ambiguity of “UTF-8 string” impedes communication.

## Specification

The following renames are proposed; in each case the old name
should remain as an alias for the new (with appropriate indications
in documentation):

- `SVf_UTF8`        -> `SVf_HEAVY`
- `SvUTF8`          -> `SvHEAVY`
- `SvUTF8_on`       -> `SvHEAVY_on`
- `SvUTF8_off`      -> `SvHEAVY_off`
- `SvPOK_only_UTF8` -> `SvPOK_only_HEAVY`
- `HeUTF8`          -> `HeHEAVY`
- `HvNAMEUTF8`      -> `HvNAMEUTF8`
- `HvENAMEUTF8`     -> `HvENAMEHEAVY`
- `PadnameUTF8`     -> `PadnameHEAVY`
- `sv_utf8_upgrade`             -> `sv_to_heavy`
- `sv_utf8_upgrade_flags`       -> `sv_to_heavy_flags`
- `sv_utf8_upgrade_flags_grow`  -> `sv_to_heavy_flags_grow`
- `sv_utf8_upgrade_nomg`        -> `sv_to_heavy_nomg`
- `sv_utf8_downgrade` -> `sv_from_heavy`

The following changes are proposed:

- `sv_dump` should output `HEAVY` rather than `UTF8`.

Controls for input/output are **NOT** proposed for rename.
This includes the likes of `COPHH_KEY_UTF8`, `SvPVutf8`, etc.

Areas of uncertainty:

- `DO_UTF8`

Additionally, Perl’s `utf8` namespace is problematic because the following
will (logically) be renamed and moved to a separate namespace:

- `utf8::upgrade()`
- `utf8::downgrade()`
- `utf8::is_utf8()`

The `Internals` namespace is a good place for these because they concern
functionality that ordinary Perl programs should not need. These can thus
become `Internals::sv_to_heavy()`, etc.

## Nomenclature Considerations

### “wide”

Ricardo Signes’s
[A Million Billion Squiggly Characters](https://www.youtube.com/watch?v=TmTeXcEixEg)
talk at YAPC/NA 2016 used the term “wide” (before lamenting that the flag
is, in fact, called “UTF8”).

The term is unideal for this since we already use the term
“wide character” to refer to a >255 codepoint. Thus, someone could
construe the term “wide string” to mean merely a string that contains
wide characters, which would fail to include a string with code points
128-255 (i.e., UTF-8 variants) that’s upgraded/heavy.

### “upgraded”

The terms “upgraded” and “downgraded” are another possibility. These
are a bit more “cumbersome”—both euphonically and visually—but are
at least currently-used terms in Perl’s documentation and C API.
They’re contextual right now to “utf8”, though (e.g., `sv_utf8_upgrade()`,
`utf8::upgrade()`) , so we may want to consider names like `sv_pv_upgrade()`
and `string::upgrade()` to mitigate that.

### “heavy”

“Heavy” is _almost_ as easy to type as “wide” and “UTF8”—and _as_ easy as
the latter if you include the dash. It currently has no specific meaning
in Perl’s documentation or API. It usefully indicates the “weight” of
processing code points stored in UTF-8: more memory usage and slower
parsing than Latin-1.



Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About