Porters, In the recent discussion in various perl-related MLs in Japanese, I have discovered a problem that the encoding pragma does not work on such multibyte encodings as Shift_JIS which uses 0x00-0x7f ranges in the 2nd byte. Though not test I am pretty sure big5 is also prone to this. To understand this problem please have a look at the hexdump below; > % hexdump -C enc-sjis.pl > 00000000 23 2f 75 73 72 2f 6c 6f 63 61 6c 2f 62 69 6e 2f > |#/usr/local/bin/| > 00000010 70 65 72 6c 20 2d 77 0a 75 73 65 20 73 74 72 69 |perl > -w.use stri| > 00000020 63 74 3b 0a 75 73 65 20 65 6e 63 6f 64 69 6e 67 |ct;.use > encoding| > 00000030 20 27 73 68 69 66 74 2d 6a 69 73 27 3b 0a 0a 6d | > 'shift-jis';..m| > 00000040 79 20 24 6e 61 6d 65 20 3d 20 22 94 5c 22 3b 0a |y $name = > ".\";.| > 00000050 70 72 69 6e 74 20 24 6e 61 6d 65 3b 0a 77 72 69 |print > $name;.wri| > 00000060 74 65 3b 0a 0a 66 6f 72 6d 61 74 20 53 54 44 4f > |te;..format STDO| > 00000070 55 54 20 3d 0a 94 5c 97 cd 3a 40 3c 3c 3c 0a 24 |UT > =..\..:@<<<.$| > 00000080 6e 61 6d 65 0a 2e 0a |name...| The perl script is a valid perl script in Shift JIS but the quoted character (U+80fd, \x94\x5c in Shift_JIS) uses \x5c in the 2nd byte, mangling the script. The encoding pragma needs to be parsable ASCII-wise. Fortunately, the encoding pragma offers a different approach via Filter=>1. The problem is that Filter option was incomplete in two ways. 0. Filter=>1 leaves STD(IN|OUT) untouched. Not only does it leave STD* untouched it completely ignores STD*=> hooks that non-filter version offers. 1. In order to touch STD(IN|OUT) sensibly you have to 'use utf8' in the script to make sure the literals therein are utf8-flagged but that makes the code too counterintuitive. The following patch fixes that so the filter option is more useful. I am planning to apply this patch to the next version of Encode but I still need to fix the POD and write test suites. So I decided to issue a waring before committing a release. Dan the Encode Maintainer --- encoding.pm 2003/01/22 03:29:07 1.40 +++ encoding.pm 2003/01/26 07:03:59 @@ -35,33 +35,11 @@ unless ($arg{Filter}) { ${^ENCODING} = $enc unless $] <= 5.008 and $utfs{$name}; $HAS_PERLIO or return 1; - for my $h (qw(STDIN STDOUT)){ - if ($arg{$h}){ - unless (defined find_encoding($arg{$h})) { - require Carp; - Carp::croak("Unknown encoding for $h, '$arg{$h}'"); - } - eval { binmode($h, ":encoding($arg{$h})") }; - }else{ - unless (exists $arg{$h}){ - eval { - no warnings 'uninitialized'; - binmode($h, ":encoding($name)"); - }; - } - } - if ($@){ - require Carp; - Carp::croak($@); - } - } }else{ defined(${^ENCODING}) and undef ${^ENCODING}; eval { require Filter::Util::Call ; Filter::Util::Call->import ; - binmode(STDIN); - binmode(STDOUT); filter_add(sub{ my $status; if (($status = filter_read()) > 0){ @@ -71,7 +49,31 @@ $status ; }); }; + # internally use utf8 to make sure utf8 flags are set + # for literals. + use utf8 (); # to fetch $utf8::hint_bits; + $^H |= $utf8::hint_bits; # warn "Filter installed"; + } + for my $h (qw(STDIN STDOUT)){ + if ($arg{$h}){ + unless (defined find_encoding($arg{$h})) { + require Carp; + Carp::croak("Unknown encoding for $h, '$arg{$h}'"); + } + eval { binmode($h, ":encoding($arg{$h})") }; + }else{ + unless (exists $arg{$h}){ + eval { + no warnings 'uninitialized'; + binmode($h, ":encoding($name)"); + }; + } + } + if ($@){ + require Carp; + Carp::croak($@); + } } return 1; # I doubt if we need it, though }Thread Next