develooper Front page | perl.perl5.porters | Postings from September 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
demerphq
Date:
September 3, 2021 06:30
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
CANgJU+U4e39m3-dpKX-oQU5CBkoCxiad9C+9qEkV7DFBDwCTvg@mail.gmail.com
On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:

> There is way too much written here so I will be responding as I can.
>
> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
>
>> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>>
>>>
>>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>>> >
>>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>>> wrote:
>>> > Per recent IRC discussion …
>>> >
>>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>>> confusion regarding the flag’s significance. Some think it indicates
>>> whether a given PV stores text versus binary. Some think it means that the
>>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>> >
>>> > The problem here is the naming. For example, consider `perl -e'my $foo
>>> = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that
>>> its code points (assuming use of a UTF-8 terminal) correspond to the bytes
>>> that encode “é” in UTF-8.
>>> >
>>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
>>> square/rectangle relationship. All strings are "rectangles", all "squares"
>>> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
>>> should assume it is a rectangle, not a square. The SQUARE flag should only
>>> be set when the rectangle has been proved conclusively to be a square. That
>>> the SQUARE flag is off does not mean the rectangle is not a square, merely
>>> that the square has not been proved to be such.
>>>
>>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>>> to be valid UTF-8”.
>>>
>>
>> I dont find your definition to be very useful, nor descriptive of how
>> perl manages these matters, so I am not using it. You are confusing
>> different levels of abstraction. Your definition also would include cases
>> where the data is already encoded and flagged as utf8. So it doesn't make
>> sense to me.
>>
>> Here is the set of definitions that I am operating from:
>>
>> A "string" is a programming concept inside of Perl which is used to
>> represent "text" buffers of memory. There are three level of abstraction
>> for strings, two of which are tightly coupled. The three are the codepoint
>> level, semantic level and encoding level.
>>
>> At the codepoint levels you can think of strings as variable length
>> arrays of numbers (codepoints), where the numbers are restricted to 0 to
>> 0x10FFFF.
>>
>> At the semantics level you can think of these numbers (codepoints) of
>> representing characters from some form of text with specific rules for
>> certain operations like case-folding, as well as a well defined mapping to
>> graphemes which are displayed to our eyes when those numbers are rendered
>> by a display device like a terminal.
>>
>> The encoding level of abstraction addresses how those numbers
>> (codepoints) will be represented as bytes (octets) in memory inside of
>> Perl, and when you directly write the data to disk or to some other output
>> stream.
>>
>> There are two sets of codepoint range, semantics and encoding available,
>> which are controlled by a flag associated with the string called the UTF8
>> flag. When set this flag indicates that the string can represent codepoints
>> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
>> memory representation is variable-width utf8. When the flag is not set it
>> indicates the string can represent codepoints 0 to 255, has ASCII
>> case-folding semantics, and that its in memory representation is fixed
>> width octets.
>>
>> In order to be able to combine these two types of strings we need to
>> define some operations:
>>
>> upgrading/downgrading: converting a string from one set of semantics and
>> encoding to the other while preserving exactly the codepoint level
>> representation. By tradition we call it upgrading when we go from Latin-1
>> to Unicode with the result being UTF8  on, and we call it downgrading when
>> we go from Unicode to Latin1 with the result being UTF8-off. These
>> operations are NOT symmetrical. It is *not* possible to downgrade every
>> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
>> string to Unicode.  By tradition upgrade and downgrade functions are noops
>> when their input is already in the form expected as the result, but this is
>> by tradition only.
>>
>> decoding/encoding: converting a string from one form to the other in a
>> way that transforms the codepoints from one form to a potentially different
>> form. Traditional we speak of decode_utf8() taking a latin1 string
>> containing octets that make up a utf8 encoded string, and returning a
>> string which is UTF8 on which represents the Unicode version of those
>> octets. For well formed input this results in no change to the underlying
>> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
>> which converts its input to a utf8 encoded form, regardless of what form it
>> was represented internally.
>>
>
> This is incorrect. Decode converts a string of bytes at the logical level
> (upgraded or downgraded does not matter) and returns a string of characters
> at the logical level (upgraded or downgraded does not matter). It may
> commonly use upgraded or downgraded strings as the input or output for
> efficiency but this is not required.
>

Nope *you* are wrong.  Decoding does not use upgrading or downgrading.
Decoding utf8 is logically equivalent to an upgrade operation when the
string contains only codepoints 0-127. For any codepoint ABOVE that it does
something very different.



>
>>
>>
>>> What you call “a UTF-8 string” is what I propose we call, per existing
>>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>>> pure-Perl context without requiring Perl programmers to worry about
>>> interpreter internals.
>>>
>>>
>> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
>> encoding". Upgrading is one way to get such a string, and it might even be
>> the most common, but the most important and likely to be correct way is
>> explicit decoding.
>>
>> If we are to rename the flag then we should just rename it as the UNICODE
>> flag. Would have saved a world of confusion.
>>
>
> This is exactly what we have defined as "upgraded". Decoding does not
> define the internal format of the resulting string at all. The only
> internal format which is upgraded is when the UTF8 flag is on.
>

Your definition is wrong then. You seem to have "upgrading" and "decoding"
muddled.

Decoding most definitely DOES define the internal format of the result
string. If you decode utf8 the result is a UTF8 on string. If that string
contained utf8 representing codepoints above 127 then the result will be
different.

If you upgrade the string: "\303\251" you will end up with a utf8 on string
which contains two codepoints, "\303" and "\251". You will NOT end up with
the correct codepoint E9


>
>>
>>> Whether Perl stores that code point as one byte or as two is Perl’s
>>> business alone … right?
>>>
>>
>> Well it would be weird if we stored Unicode data in a form not supported
>> by Unicode. Dont you think? There is no single octet representation of the
>> codepoint E9 defined by Unicode as far as I know.
>>
>>
>>>
>>> > I do not understand your point that only the initiated can understand
>>> this flag. It means one and only one thing: that the perl internals should
>>> assume that the buffer contains utf8 encoded data and that perl should
>>> apply unicode semantics when doing character and case-sensitive operations,
>>> and that perl can make certain assumptions when it processing the data (eg
>>> that is not malformed).
>>>
>>> The behaviour you’re talking about is what the unicode_strings and
>>> unicode_eval features specifically do away with (i.e., fix), right?
>>
>>
>> Im not familiar with those enough to comment. I assume they relate to
>> what assumptions Perl should make about strings which are constructed as
>> literals in the source code, where there is a great deal of ambiguity about
>> what is going on compared to actual code that constructs such strings,
>> where things are exact.
>>
>
> They do not. They relate to consistently applying unicode rules to the
> logical contents of the strings (in practice, making sure to work with
> upgraded strings internally). The only mechanism that affects the
> interpretation of literal strings is "use utf8.
>

Ill read up on this.


>
>
>>
>>
>>>
>>> You’re omitting what IMO is the most obvious purpose of the flag: to
>>> indicate whether the code points that the PV stores are the plain bytes, or
>>> are the UTF-8-decoded code points. This is why you can print() the string
>>> in either upgraded or downgraded forms, and it comes out the same.
>>>
>>
>> Its hard to say what you are referring to here. If you mean codepoints
>> 0-127, then it is unsurprising as the representation of them is equivalent
>> in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
>> plane, then no they should not come out the same. If you are piping that
>> data to a file I would expect the octets written to that file to be
>> different. (assuming a binary filehandle with no layers magically
>> transforming things). If your terminal renders them the same then I assume
>> it is doing some magic behind the scenes to deal with malformed utf8.
>>
>
> Not correct. An upgraded or downgraded string prints identically because
> you are printing the logical ordinals which do not change by this
> operation. Whether those ordinals are interpreted as bytes or Unicode
> characters depends what you are printing to, but in either case the
> internally-stored bytes are irrelevant to the user except to determine what
> those logical ordinals are
>

Dude, you keep saying I am not correct when what I have said is easily
verifiable.

If you print chr(0xe9) to a filehandle and it does not contain the octet E9
then there is a problem

If you print chr(0xe9) to a utf8 terminal it should render a Unicode
replacement character for a broken utf8 sequence.

If you print an encoded chr(0xe9) then it should rendr the glyph for E9.

If you think anything else is happening then prove it with code.


>
>>
>>>
>>> > I also know what happens here:
>>> >
>>> > my $foo="\x{c3}\x{a9}";
>>> > utf8::decode($foo);
>>> > Dump($foo);
>>> >
>>> > SV = PV(0x2303fc0) at 0x2324c98
>>> >   REFCNT = 1
>>> >   FLAGS = (POK,IsCOW,pPOK,UTF8)
>>> >   PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
>>> >   CUR = 2
>>> >   LEN = 10
>>> >   COW_REFCNT = 1
>>> >
>>> > That is, i start off with two octets, C3 - A9, which happens to be the
>>> encoding for the codepoint E9, which happens to be é.
>>> > I then tell perl to "decode" those octets, which really means I tell
>>> perl to check that the octets actually do make up valid utf8. And if perl
>>> agrees that indeed these are valid utf8 octets, then it turns the flag on.
>>> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
>>> to decode, all that matters is that at an octet level those octet happen to
>>> make up valid utf8.
>>>
>>> I think you’re actually breaking the abstraction here by assuming that
>>> Perl implements the decode by setting a flag.
>>>
>>>
>> No I am not. The flag is there is there to tell the perl internals how to
>> manipulate the string. decode's task is to take arbitrary strings of octets
>> and ensure that they can be decoded as valid utf8 and possibly to do some
>> conversion (eg for forbidden utf8 sequences or other normalization) as it
>> does so and then SETS THE FLAG. Only once decode is done is the string
>> "Unicode" and is the string "utf8". Prior to that it was just random
>> octets. It doesnt need to do anything BUT set the flag because its internal
>> encoding matches the external encoding in this case. If it was decoding
>> UTF16LE then it would have do conversion as well.
>>
>
> Not correct. The flag is there only to tell Perl internals whether the
> internal bytes represent the ordinals directly or via UTF-8-like encoding.
> The result of decoding can be downgraded, and an upgraded string can be
> decoded,
>

Show me the code. As far as I know decode operations do not operate on
unicode strings. Calling decode_utf8 on a string which is utf8 is a noop.


> these are perfectly cromulent operations if the logical contents are as
> expected. A unicode string can exist without ever having been decoded, all
> that is required is to call a function that interprets the ordinals as a
> unicode string.
>

"A unicode string can exist without ever having been decoded, all that is
required is to call a function that interprets the ordinals as a unicode
string."

And that function that does that interpretation is called decode.  You just
contradicted yourself.


>>
>>> It would be just as legitimate to mutate the PV to store a single octet,
>>> 0xe9, and leave the UTF8 flag off.
>>
>>
>> Nope. That would mean that Perl would use ASCII/Latin-1 case folding
>> rules on the result, which would be wrong. It should use Unicode case
>> folding rules for codepoint E9 if it was decoded as that codepoint. (Change
>> the example to \x{DF} and you can see these issues in the flesh, \x{DF}
>> should match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}.
>> The lc() version of \x{DF} is "ss" but in Latin-1/Ascii there are no
>> multi-byte case folds).  Even more suggestive that Perl doing this would be
>> wrong is that in fact there is NO valid Unicode encoding of codepoint E9
>> which is only 1 octet long. So that would be extremely wrong of Perl to use
>> a non Unicode encoding of unicode data dont you think? Also, what would
>> perl do when the codepoint doesn't fit into a single octet?  Your argument
>> might have some merit if you were arguing that Perl could have decoded it
>> into "\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make
>> sense.
>>
>
> Not correct. Under old rules, yes, the UTF8 flag determined whether
> Unicode rules are used in various operations; this was an abstraction
> break, and so the unicode_strings feature was added to fix the problem, and
> enabled in feature bundles since 5.12
>

Ah, ok, so if you *change* the default mode of perl it does something
different than I described, and that makes my comments "incorrect"? What i
described is how "normal" perl without any new features enabled works. If
there are features that change what I have said feel free to use them. But
it doesnt change that what I said is an accurate version of how the perl
internals normally function.

cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About