develooper Front page | perl.perl5.porters | Postings from August 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
August 20, 2021 17:48
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
E3180753-2A82-4D52-9110-317815513DB8@felipegasper.com

> On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
> 
> On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com> wrote:
> Per recent IRC discussion …
> 
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
> 
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8.
> 
> Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a square/rectangle relationship. All strings are "rectangles", all "squares" are rectangles, some strings are squares, but unless SQUARE flag is ON perl should assume it is a rectangle, not a square. The SQUARE flag should only be set when the rectangle has been proved conclusively to be a square. That the SQUARE flag is off does not mean the rectangle is not a square, merely that the square has not been proved to be such.

You’re defining “a UTF-8 string” as “a string whose PV is marked as UTF-8”. I’m defining it as “a string whose Perl-visible code points happen to be valid UTF-8”.

What you call “a UTF-8 string” is what I propose we call, per existing nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with corresponding code changes. Then the term “UTF-8 string” makes sense from a pure-Perl context without requiring Perl programmers to worry about interpreter internals.

> The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
> 
> Except it is valid UTF-8: (at least in my utf8 terminal).
> 
> $ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
> SV = PV(0x153efc0) at 0x155fb38
>   REFCNT = 1
>   FLAGS = (POK,IsCOW,pPOK,UTF8)
>   PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
>   CUR = 2
>   LEN = 10
>   COW_REFCNT = 1
> 
> So the string is UTF-8. 

Again, different definitions. The Perl-visible string contains a single code point, 0xe9. This code point doesn’t correspond to valid UTF-8 bytes, so IMO it doesn’t make sense to call it a “UTF-8 string”. Whether Perl stores that code point as one byte or as two is Perl’s business alone … right?

> I do not understand your point that only the initiated can understand this flag. It means one and only one thing: that the perl internals should assume that the buffer contains utf8 encoded data and that perl should apply unicode semantics when doing character and case-sensitive operations, and that perl can make certain assumptions when it processing the data (eg that is not malformed). 

The behaviour you’re talking about is what the unicode_strings and unicode_eval features specifically do away with (i.e., fix), right?

You’re omitting what IMO is the most obvious purpose of the flag: to indicate whether the code points that the PV stores are the plain bytes, or are the UTF-8-decoded code points. This is why you can print() the string in either upgraded or downgraded forms, and it comes out the same.

> BTW, your scheme needs to account for WAS_UTF8 as well. Most people dont know it, but there are actually three types of strings in the perl internals, UTF8-ON, UTF8-OFF, UTF8-OFF + WAS_UTF8. It only manifests in hash keys. But it needs to be accounted for as well in any renaming. Perl dictates that keys which are character-wise equivalent hash the same regardless of the UTF8 flag (or put alternative, the hash should be of the codepoints the string represents NOT the octets that make up that representation). This means UTF8-ON keys are always downgraded on lookup or store in a hash. If the downgrade is successful the key is marked as WAS-UTF8 and the downgraded string is stored and hashed, if it was unsuccessful (eg it contains codepoints above 255) it is marked as UTF8-ON and the original buffer is hashed. When the key is extracted with keys() or each() if the WASUTF8 flag is set the string is upgraded back to the UTF8 form. 

Thank you for this. I knew about the was-UTF8 status but didn’t know why it exists.

> I think you need to step back and consider that strings are sequences of octets. Sometimes those octets are ordered such that they can be interpreted as utf8. The UTF-8 flag being on tells perl that it can and should treat the octets as utf8. 

C strings are sequences of octets, yes. Perl strings, though, are sequences of code points, not octets. In this they’re more like JavaScript strings than C strings.

>   my $foo = "é";
> 
> I don't know exactly what that code does without doing an octet level investigation of the data. It could be one octet and in latin-1 or it could be two octets and be Unicode in one of several formats (utf8, utf-16BE utf-16LE) and still be rendered identically in an editor or browser.

Sorry, I assumed we all use UTF-8 terminals. :) But yes, I should have written it as two \x escapes, sorry.

> I also know what happens here:
> 
> my $foo="\x{c3}\x{a9}";
> utf8::decode($foo);
> Dump($foo);
> 
> SV = PV(0x2303fc0) at 0x2324c98
>   REFCNT = 1
>   FLAGS = (POK,IsCOW,pPOK,UTF8)
>   PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
>   CUR = 2
>   LEN = 10
>   COW_REFCNT = 1
> 
> That is, i start off with two octets, C3 - A9, which happens to be the encoding for the codepoint E9, which happens to be é.
> I then tell perl to "decode" those octets, which really means I tell perl to check that the octets actually do make up valid utf8. And if perl agrees that indeed these are valid utf8 octets, then it turns the flag on. Now it doesn't matter if you *meant* to construct utf8 in the variable fed to decode, all that matters is that at an octet level those octet happen to make up valid utf8. 

I think you’re actually breaking the abstraction here by assuming that Perl implements the decode by setting a flag.

It would be just as legitimate to mutate the PV to store a single octet, 0xe9, and leave the UTF8 flag off. Perl doesn’t do that, of course, because it’s easier just to set a flag, but as long as the string content is the single code point 0xe9 it doesn’t really matter how Perl achieves that.

(Notwithstanding, of course, the abstraction leaks that things like the unicode_strings feature and Sys::Binmode fix.)

There are parts of the code that appear to go the other way and prioritize downgraded storage. Perl_refcounted_he_fetch_pvn(), for example.

-FG
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About