Leon Timmermans
September 3, 2021 13:10
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
On Fri, Sep 3, 2021 at 8:30 AM demerphq <> wrote:

> On Thu, 2 Sept 2021 at 16:39, Dan Book <> wrote:
>> There is way too much written here so I will be responding as I can.
>> On Thu, Sep 2, 2021 at 9:21 AM demerphq <> wrote:
>>> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <>
>>> wrote:
>>>> > On Aug 20, 2021, at 1:05 PM, demerphq <> wrote:
>>>> >
>>>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <>
>>>> wrote:
>>>> > Per recent IRC discussion …
>>>> >
>>>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>>>> confusion regarding the flag’s significance. Some think it indicates
>>>> whether a given PV stores text versus binary. Some think it means that the
>>>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>>> >
>>>> > The problem here is the naming. For example, consider `perl -e'my
>>>> $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact
>>>> that its code points (assuming use of a UTF-8 terminal) correspond to the
>>>> bytes that encode “é” in UTF-8.
>>>> >
>>>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like
>>>> a square/rectangle relationship. All strings are "rectangles", all
>>>> "squares" are rectangles, some strings are squares, but unless SQUARE flag
>>>> is ON perl should assume it is a rectangle, not a square. The SQUARE flag
>>>> should only be set when the rectangle has been proved conclusively to be a
>>>> square. That the SQUARE flag is off does not mean the rectangle is not a
>>>> square, merely that the square has not been proved to be such.
>>>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>>>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>>>> to be valid UTF-8”.
>>> I dont find your definition to be very useful, nor descriptive of how
>>> perl manages these matters, so I am not using it. You are confusing
>>> different levels of abstraction. Your definition also would include cases
>>> where the data is already encoded and flagged as utf8. So it doesn't make
>>> sense to me.
>>> Here is the set of definitions that I am operating from:
>>> A "string" is a programming concept inside of Perl which is used to
>>> represent "text" buffers of memory. There are three level of abstraction
>>> for strings, two of which are tightly coupled. The three are the codepoint
>>> level, semantic level and encoding level.
>>> At the codepoint levels you can think of strings as variable length
>>> arrays of numbers (codepoints), where the numbers are restricted to 0 to
>>> 0x10FFFF.
>>> At the semantics level you can think of these numbers (codepoints) of
>>> representing characters from some form of text with specific rules for
>>> certain operations like case-folding, as well as a well defined mapping to
>>> graphemes which are displayed to our eyes when those numbers are rendered
>>> by a display device like a terminal.
>>> The encoding level of abstraction addresses how those numbers
>>> (codepoints) will be represented as bytes (octets) in memory inside of
>>> Perl, and when you directly write the data to disk or to some other output
>>> stream.
>>> There are two sets of codepoint range, semantics and encoding available,
>>> which are controlled by a flag associated with the string called the UTF8
>>> flag. When set this flag indicates that the string can represent codepoints
>>> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
>>> memory representation is variable-width utf8. When the flag is not set it
>>> indicates the string can represent codepoints 0 to 255, has ASCII
>>> case-folding semantics, and that its in memory representation is fixed
>>> width octets.
>>> In order to be able to combine these two types of strings we need to
>>> define some operations:
>>> upgrading/downgrading: converting a string from one set of semantics and
>>> encoding to the other while preserving exactly the codepoint level
>>> representation. By tradition we call it upgrading when we go from Latin-1
>>> to Unicode with the result being UTF8  on, and we call it downgrading when
>>> we go from Unicode to Latin1 with the result being UTF8-off. These
>>> operations are NOT symmetrical. It is *not* possible to downgrade every
>>> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
>>> string to Unicode.  By tradition upgrade and downgrade functions are noops
>>> when their input is already in the form expected as the result, but this is
>>> by tradition only.
>>> decoding/encoding: converting a string from one form to the other in a
>>> way that transforms the codepoints from one form to a potentially different
>>> form. Traditional we speak of decode_utf8() taking a latin1 string
>>> containing octets that make up a utf8 encoded string, and returning a
>>> string which is UTF8 on which represents the Unicode version of those
>>> octets. For well formed input this results in no change to the underlying
>>> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
>>> which converts its input to a utf8 encoded form, regardless of what form it
>>> was represented internally.
>> This is incorrect. Decode converts a string of bytes at the logical level
>> (upgraded or downgraded does not matter) and returns a string of characters
>> at the logical level (upgraded or downgraded does not matter). It may
>> commonly use upgraded or downgraded strings as the input or output for
>> efficiency but this is not required.
> Nope *you* are wrong.  Decoding does not use upgrading or downgrading.
> Decoding utf8 is logically equivalent to an upgrade operation when the
> string contains only codepoints 0-127. For any codepoint ABOVE that it does
> something very different.
>>>> What you call “a UTF-8 string” is what I propose we call, per existing
>>>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>>>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>>>> pure-Perl context without requiring Perl programmers to worry about
>>>> interpreter internals.
>>> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
>>> encoding". Upgrading is one way to get such a string, and it might even be
>>> the most common, but the most important and likely to be correct way is
>>> explicit decoding.
>>> If we are to rename the flag then we should just rename it as the
>>> UNICODE flag. Would have saved a world of confusion.
>> This is exactly what we have defined as "upgraded". Decoding does not
>> define the internal format of the resulting string at all. The only
>> internal format which is upgraded is when the UTF8 flag is on.
> Your definition is wrong then. You seem to have "upgrading" and "decoding"
> muddled.
> Decoding most definitely DOES define the internal format of the result
> string. If you decode utf8 the result is a UTF8 on string. If that string
> contained utf8 representing codepoints above 127 then the result will be
> different.
> If you upgrade the string: "\303\251" you will end up with a utf8 on
> string which contains two codepoints, "\303" and "\251". You will NOT end
> up with the correct codepoint E9

It rather sounds to me like your disagreement is mostly on definitions.
This happens a lot in discussing perl unicode support

>>>> You’re omitting what IMO is the most obvious purpose of the flag: to
>>>> indicate whether the code points that the PV stores are the plain bytes, or
>>>> are the UTF-8-decoded code points. This is why you can print() the string
>>>> in either upgraded or downgraded forms, and it comes out the same.
>>> Its hard to say what you are referring to here. If you mean codepoints
>>> 0-127, then it is unsurprising as the representation of them is equivalent
>>> in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
>>> plane, then no they should not come out the same. If you are piping that
>>> data to a file I would expect the octets written to that file to be
>>> different. (assuming a binary filehandle with no layers magically
>>> transforming things). If your terminal renders them the same then I assume
>>> it is doing some magic behind the scenes to deal with malformed utf8.
>> Not correct. An upgraded or downgraded string prints identically because
>> you are printing the logical ordinals which do not change by this
>> operation. Whether those ordinals are interpreted as bytes or Unicode
>> characters depends what you are printing to, but in either case the
>> internally-stored bytes are irrelevant to the user except to determine what
>> those logical ordinals are
> Dude, you keep saying I am not correct when what I have said is easily
> verifiable.
> If you print chr(0xe9) to a filehandle and it does not contain the octet
> E9 then there is a problem
> If you print chr(0xe9) to a utf8 terminal it should render a Unicode
> replacement character for a broken utf8 sequence.
> If you print an encoded chr(0xe9) then it should rendr the glyph for E9.
> If you think anything else is happening then prove it with code.

That is all true in the absence of an :encoding(...) or :utf8 layer.

An upgraded E9 will also still print E9 (and thus be broken utf-8).

>>>> > I also know what happens here:
>>>> >
>>>> > my $foo="\x{c3}\x{a9}";
>>>> > utf8::decode($foo);
>>>> > Dump($foo);
>>>> >
>>>> > SV = PV(0x2303fc0) at 0x2324c98
>>>> >   REFCNT = 1
>>>> >   FLAGS = (POK,IsCOW,pPOK,UTF8)
>>>> >   PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
>>>> >   CUR = 2
>>>> >   LEN = 10
>>>> >   COW_REFCNT = 1
>>>> >
>>>> > That is, i start off with two octets, C3 - A9, which happens to be
>>>> the encoding for the codepoint E9, which happens to be é.
>>>> > I then tell perl to "decode" those octets, which really means I tell
>>>> perl to check that the octets actually do make up valid utf8. And if perl
>>>> agrees that indeed these are valid utf8 octets, then it turns the flag on.
>>>> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
>>>> to decode, all that matters is that at an octet level those octet happen to
>>>> make up valid utf8.
>>>> I think you’re actually breaking the abstraction here by assuming that
>>>> Perl implements the decode by setting a flag.
>>> No I am not. The flag is there is there to tell the perl internals how
>>> to manipulate the string. decode's task is to take arbitrary strings of
>>> octets and ensure that they can be decoded as valid utf8 and possibly to do
>>> some conversion (eg for forbidden utf8 sequences or other normalization) as
>>> it does so and then SETS THE FLAG. Only once decode is done is the string
>>> "Unicode" and is the string "utf8". Prior to that it was just random
>>> octets. It doesnt need to do anything BUT set the flag because its internal
>>> encoding matches the external encoding in this case. If it was decoding
>>> UTF16LE then it would have do conversion as well.
>> Not correct. The flag is there only to tell Perl internals whether the
>> internal bytes represent the ordinals directly or via UTF-8-like encoding.
>> The result of decoding can be downgraded, and an upgraded string can be
>> decoded,
> Show me the code. As far as I know decode operations do not operate on
> unicode strings. Calling decode_utf8 on a string which is utf8 is a noop.

Almost. It will try to downgrade the string, and if that fails it will
return false (and thus noop). It will decode a latin1-safe unicode string.

So «my $s = "\303\251"; utf8::upgrade($s); utf8::decode($s)» will result in
$s being equal to "\x{e9}" (an will be upgraded)


