develooper Front page | perl.perl5.porters | Postings from September 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
demerphq
Date:
September 2, 2021 14:02
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
CANgJU+UrzLCQEjsgTgr4eeuWmJCHo8-0HmDoh-9FKEP6ZncS6Q@mail.gmail.com
On Thu, 2 Sept 2021 at 15:20, demerphq <demerphq@gmail.com> wrote:

> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>>
>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>> >
>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>> > Per recent IRC discussion …
>> >
>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>> >
>> > The problem here is the naming. For example, consider `perl -e'my $foo
>> = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that
>> its code points (assuming use of a UTF-8 terminal) correspond to the bytes
>> that encode “é” in UTF-8.
>> >
>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
>> square/rectangle relationship. All strings are "rectangles", all "squares"
>> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
>> should assume it is a rectangle, not a square. The SQUARE flag should only
>> be set when the rectangle has been proved conclusively to be a square. That
>> the SQUARE flag is off does not mean the rectangle is not a square, merely
>> that the square has not been proved to be such.
>>
>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>> to be valid UTF-8”.
>>
>
> I dont find your definition to be very useful, nor descriptive of how perl
> manages these matters, so I am not using it. You are confusing different
> levels of abstraction. Your definition also would include cases where the
> data is already encoded and flagged as utf8. So it doesn't make sense to me.
>
> Here is the set of definitions that I am operating from:
>
> A "string" is a programming concept inside of Perl which is used to
> represent "text" buffers of memory. There are three level of abstraction
> for strings, two of which are tightly coupled. The three are the codepoint
> level, semantic level and encoding level.
>
> At the codepoint levels you can think of strings as variable length arrays
> of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.
>
> At the semantics level you can think of these numbers (codepoints) of
> representing characters from some form of text with specific rules for
> certain operations like case-folding, as well as a well defined mapping to
> graphemes which are displayed to our eyes when those numbers are rendered
> by a display device like a terminal.
>
> The encoding level of abstraction addresses how those numbers (codepoints)
> will be represented as bytes (octets) in memory inside of Perl, and when
> you directly write the data to disk or to some other output stream.
>
> There are two sets of codepoint range, semantics and encoding available,
> which are controlled by a flag associated with the string called the UTF8
> flag. When set this flag indicates that the string can represent codepoints
> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
> memory representation is variable-width utf8. When the flag is not set it
> indicates the string can represent codepoints 0 to 255, has ASCII
> case-folding semantics, and that its in memory representation is fixed
> width octets.
>
> In order to be able to combine these two types of strings we need to
> define some operations:
>
> upgrading/downgrading: converting a string from one set of semantics and
> encoding to the other while preserving exactly the codepoint level
> representation. By tradition we call it upgrading when we go from Latin-1
> to Unicode with the result being UTF8  on, and we call it downgrading when
> we go from Unicode to Latin1 with the result being UTF8-off. These
> operations are NOT symmetrical. It is *not* possible to downgrade every
> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
> string to Unicode.  By tradition upgrade and downgrade functions are noops
> when their input is already in the form expected as the result, but this is
> by tradition only.
>
> decoding/encoding: converting a string from one form to the other in a way
> that transforms the codepoints from one form to a potentially different
> form. Traditional we speak of decode_utf8() taking a latin1 string
> containing octets that make up a utf8 encoded string, and returning a
> string which is UTF8 on which represents the Unicode version of those
> octets. For well formed input this results in no change to the underlying
> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
> which converts its input to a utf8 encoded form, regardless of what form it
> was represented internally.
>
> When we are confronted with combining the two forms of string Perl has
> little choice but to use the "safe" strategy of "upgrading" the Latin-1
> parts to Unicode.
>
> Both the operations of "upgrading" and "decoding" result in Utf8-on
> strings, and indeed both can result in not changing their input at all, but
> when they do change their input they change it very differently. Most of
> the places people get into trouble with strings is when they end up doing
> upgrade operations when they should have done a decode operation. This is
> because upgrade operations can happen implicitly based on simple rules and
> thus can happen "by accident", but decode operations are always explicit so
> they never happen without the involvement of the developer in some way.
> This is at least partly because upgrade operations do not have any failure
> modes but decode operations do.
>
> Most of the time, as long as you are only thinking about codepoints,
> developers dont have to worry about this stuff. The places where they do
> are when they are reading or writing data, and in some cases when they are
> embedding string constants in their code where they want a particular set
> of semantics and encoding. As long as people are disciplined to use
> decode_utf8() before they use utf8 string data, and encode_utf8  before
> they emit it then the complexities above should be transparent to the
> developer.
>


I was rereading this and I thought of something to add here. Part of the
confusion with Perl strings is that we try to hide the flag. We dont really
want people to look at it and think about it. Instead we  provide a handful
of verbs which can be used to force the string to the shape we want, or
throw an error if we cant  (or sometimes be a no-op).

I mean, if I want to be sure i have a latin-1 string then i would do
something like:

eval { utf8::downgrade($str); 1 } or warn "Cant downgrade string!";

And if want to be user I have a utf8 string then I would do something like:

utf8::upgrade($str);

I wonder if we made accessing the flag state more socially acceptable
whether people would find this less confusing.

Yves

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About