develooper Front page | perl.perl5.porters | Postings from August 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
demerphq
Date:
August 20, 2021 17:06
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
CANgJU+U4pwpCRbzfVg+=nPXx+ThVFbaWbn0uOeVGRCN454mpgg@mail.gmail.com
On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com> wrote:

> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo =
> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
> encode “é” in UTF-8.


Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
square/rectangle relationship. All strings are "rectangles", all "squares"
are rectangles, some strings are squares, but unless SQUARE flag is ON perl
should assume it is a rectangle, not a square. The SQUARE flag should
only be set when the rectangle has been proved conclusively to be a square.
That the SQUARE flag is off does not mean the rectangle is not a square,
merely that the square has not been proved to be such.


The “UTF-8 flag”, however, is likely *not* set on this string. By contrast,
> consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag”
> set, but $foo is NOT a “UTF-8 string” because its code points (in this
> case, only 1) aren’t valid UTF-8.
>

Except it is valid UTF-8: (at least in my utf8 terminal).

$ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
SV = PV(0x153efc0) at 0x155fb38
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
  CUR = 2
  LEN = 10
  COW_REFCNT = 1

So the string is UTF-8.

You cannot get the UTF-8 flag on without using XS tricks and have the
buffer contain non-utf8. It is that simple. (Sure you can do it with
Encode::_utf8_on() but that is XS.)

I do not understand your point that only the initiated can understand this
flag. It means one and only one thing: that the perl internals should
assume that the buffer contains utf8 encoded data and that perl should
apply unicode semantics when doing character and case-sensitive operations,
and that perl can make certain assumptions when it processing the data (eg
that is not malformed).

When it is off it does not mean that the data cannot be utf8 data, merely
that Perl cannot and should not assume it is utf8 data, and should not try
to interpret it as utf8 data when the string is used in character
operations, and that when it is used in case-sensitive operations it should
use the tradition limited case-insensitive logic from ASCII.

Personally I think renaming this flag will just increase confusion, not
decrease.

BTW, your scheme needs to account for WAS_UTF8 as well. Most people dont
know it, but there are actually three types of strings in the perl
internals, UTF8-ON, UTF8-OFF, UTF8-OFF + WAS_UTF8. It only manifests in
hash keys. But it needs to be accounted for as well in any renaming. Perl
dictates that keys which are character-wise equivalent hash the same
regardless of the UTF8 flag (or put alternative, the hash should be of the
codepoints the string represents NOT the octets that make up that
representation). This means UTF8-ON keys are always downgraded on lookup or
store in a hash. If the downgrade is successful the key is marked as
WAS-UTF8 and the downgraded string is stored and hashed, if it was
unsuccessful (eg it contains codepoints above 255) it is marked as UTF8-ON
and the original buffer is hashed. When the key is extracted with keys() or
each() if the WASUTF8 flag is set the string is upgraded back to the UTF8
form.

I think you need to step back and consider that strings are sequences of
octets. Sometimes those octets are ordered such that they can be
interpreted as utf8. The UTF-8 flag being on tells perl that it can and
should treat the octets as utf8.

You used examples that involve source code which I think might be confusing
you, as it introduces weird issues related to what character set your
terminal thinks it is using, and what format the text in the file is stored
in, and what operating system is in use.  If you stick to examples that
only use code then all of that ambiguity goes away and it should be easy to
understand. Eg when you say:

  my $foo = "é";

I don't know exactly what that code does without doing an octet level
investigation of the data. It could be one octet and in latin-1 or it could
be two octets and be Unicode in one of several formats (utf8, utf-16BE
utf-16LE) and still be rendered identically in an editor or browser.

However if you say:

my $foo= chr(0xe9); # é

I know exactly what is going on, and what $foo should contain.

I also know what happens here:

my $foo="\x{c3}\x{a9}";
utf8::decode($foo);
Dump($foo);

SV = PV(0x2303fc0) at 0x2324c98
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
  CUR = 2
  LEN = 10
  COW_REFCNT = 1

That is, i start off with two octets, C3 - A9, which happens to be the
encoding for the codepoint E9, which happens to be é.
I then tell perl to "decode" those octets, which really means I tell perl
to check that the octets actually do make up valid utf8. And if perl agrees
that indeed these are valid utf8 octets, then it turns the flag on. Now it
doesn't matter if you *meant* to construct utf8 in the variable fed to
decode, all that matters is that at an octet level those octet happen to
make up valid utf8.

Try

my $foo="\x{c3}\x{a9}\x{c3}";
utf8::decode($foo);
Dump($foo);
SV = PV(0x23040a0) at 0x23249f8
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x2329350 "\303\251\303"\0
  CUR = 3
  LEN = 10

So here we can see that perl did nothing with this version of $foo, because
it did not contain a valid utf8 sequence. \x{c3} can never be the last byte
in valid utf8, it always must be followed by something, so perl did not
turn the UTF8 flag on.

Work the problem like this a while and you will see that really this
subject is pretty simple, and there is a tremendous amount of fud about it
when in fact it is really simple. The flag says that the buffer contains
valid octets that are not illegal utf8, and that perl should apply
utf8/unicode semantics when doing "character" operations on the string. The
flag being off means that when doing character operations it should assume
fixed width octet operations, and it should use ASCII case-folding rules.
That is it. The flag being off does not *ever* mean the data is NOT utf8,
it simply means that data has not been *validated* as utf8 and thus perl
cannot use utf8 rules to process it. That is it.

cheers,
Yves

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About