On Thu, Sep 2, 2021 at 9:03 PM Yuki Kimoto <kimoto.yuki@gmail.com> wrote: > I want to get the basic knowledge to join this discussion. > > Would you tell me the following things? > > 1. Do the following things mean the same or different? > > my $bytes = Encode::encode('UTF-8', $string); > > utf8::encode($string); > my $bytes = $string; > Similar, with some implementation differences: Encode::encode doesn't modify $string in place (with those arguments), and utf8::encode does; Encode::encode with UTF-8 will encode invalid codepoints (such as surrogates, supercharacters) to replacement characters (with those arguments) and utf8::encode will naively encode them with Perl's internal encoding method like other codepoints (which can result in bytestrings which UTF-8 decoders may consider invalid). > 2. Do the following things mean the same or different? > > my $string = Encode::decode('UTF-8', $bytes); > > utf8::decode($bytes); > my $string = $bytes; > Similar as above, but additionally, if the bytes cannot be interpreted as even Perl's lax internal encoding, utf8::decode will return false and leave the string unchanged; whereas Encode::decode decodes malformed byte sequences to replacement characters (with those arguments). Encode::decode will also decode invalid codepoints to replacement characters, but utf8::decode will naively accept them. > 3. Do the following things mean the same or different? > > # Perl > utf8::decode > > # XS > sv_utf8_decode > These are the same. 4. Do the following things mean the same or different? > > # Perl > utf8::encode > > # XS > sv_utf8_encode > These are the same. Overall, all of these change the logical contents of the string from bytes to the Unicode characters they represent, or from Unicode characters to representative bytes. -DanThread Previous | Thread Next