develooper Front page | perl.perl5.porters | Postings from September 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
Dan Book
Date:
September 3, 2021 01:30
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
CABMkAVXc1kOOLkMUO5NocfxapOmQxYEAcwAbEHrEfSc0Te=f3w@mail.gmail.com
On Thu, Sep 2, 2021 at 9:03 PM Yuki Kimoto <kimoto.yuki@gmail.com> wrote:

> I want to get the basic knowledge to join this discussion.
>
> Would you tell me the following things?
>
> 1. Do the following things mean the same or different?
>
>   my $bytes = Encode::encode('UTF-8', $string);
>
>   utf8::encode($string);
>   my $bytes = $string;
>

Similar, with some implementation differences: Encode::encode doesn't
modify $string in place (with those arguments), and utf8::encode does;
Encode::encode with UTF-8 will encode invalid codepoints (such as
surrogates, supercharacters) to replacement characters (with those
arguments) and utf8::encode will naively encode them with Perl's internal
encoding method like other codepoints (which can result in bytestrings
which UTF-8 decoders may consider invalid).


> 2. Do the following things mean the same or different?
>
>   my $string = Encode::decode('UTF-8', $bytes);
>
>   utf8::decode($bytes);
>   my $string = $bytes;
>

Similar as above, but additionally, if the bytes cannot be interpreted as
even Perl's lax internal encoding, utf8::decode will return false and leave
the string unchanged; whereas Encode::decode decodes malformed byte
sequences to replacement characters (with those arguments). Encode::decode
will also decode invalid codepoints to replacement characters, but
utf8::decode will naively accept them.


> 3. Do the following things mean the same or different?
>
>   # Perl
>   utf8::decode
>
>   # XS
>   sv_utf8_decode
>

These are the same.

4. Do the following things mean the same or different?
>
>   # Perl
>   utf8::encode
>
>   # XS
>   sv_utf8_encode
>

These are the same.

Overall, all of these change the logical contents of the string from bytes
to the Unicode characters they represent, or from Unicode characters to
representative bytes.

-Dan

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About