> On Mar 30, 2021, at 3:20 AM, Dan Book <grinnz@gmail.com> wrote: > > On Tue, Mar 30, 2021 at 3:07 AM Yuki Kimoto <kimoto.yuki@gmail.com> wrote: > I still don't understand this problem. > > Is 128-255(latin-1 and UTF-8 shared) range problem? > > More specifically the problem is that those codepoints are both used by encodings such as latin-1 and UTF-8 bytes, and also are valid unicode codepoints (which are mostly the same as the characters they encode as latin-1 bytes). This, with Perl's string design, means the program cannot know whether a string intends to contain decoded text or encoded bytes. To flesh this out a bit: Consider the following XSUB: void printit (char * thestring) CODE: fprintf(stdout, thestring); Now consider the following Perl: my $str_a = "\xff"; printit($str); It’ll print one byte, right? Now consider this Perl: my $str_b = "\xff\x{100}"; chop $str_b; printit($str); That will print TWO bytes. This despite that $str_b eq’s $str_a. The problem is that `char *` in XS’s default typemap uses SvPV. Ordinarily this is fine as long as you check the SV’s SvUTF8, but the typemap doesn’t do that, so our C code has no way to know how many actual characters the original Perl string contains. In this case, $str_b’s PV is UTF-8, while $str_a is Latin-1, which means that, although both Perl strings contain code point 255, SvPV will yield different C strings for them. The fix here is to switch the typemap to SvPVbyte so that identical Perl strings will yield identical C representation. That was a somewhat contrived example; a more realistic one might be: my $str_c = Encode::Simple::decode_utf8('ÆÆÆ'); printit($str_c); Here, printit() does something people might realistically expect: its use of the internal PV effects a de facto “auto-UTF-8-encode”. If the default typemap were SvPVbyte here, printit() would throw an exception, thus breaking the application. The Perl application *should* encode_utf8() for itself, but lots of Perl code out there probably depends on this auto-encode behaviour. The same bug exists if you mkdir($str_a) and mkdir($str_b). See Sys::Binmode for a CPAN fix. I hope to convince enough folks here to bring that into core eventually. -FThread Previous | Thread Next