develooper Front page | perl.perl5.porters | Postings from January 2003

[Encode] HEADS-UP; $Encode::VERSION++ to enhance filter option

Thread Next
From:
Dan Kogai
Date:
January 25, 2003 23:52
Subject:
[Encode] HEADS-UP; $Encode::VERSION++ to enhance filter option
Message ID:
FA81C454-3102-11D7-B3C0-000393AE4244@dan.co.jp
Porters,

   In the recent discussion in various perl-related MLs in Japanese, I 
have discovered a problem that the encoding pragma does not work on 
such multibyte encodings as Shift_JIS which uses 0x00-0x7f ranges in 
the 2nd byte.  Though not test I am pretty sure big5 is also prone to 
this.

   To understand this problem please have a look at the hexdump below;

> % hexdump -C enc-sjis.pl
> 00000000  23 2f 75 73 72 2f 6c 6f  63 61 6c 2f 62 69 6e 2f  
> |#/usr/local/bin/|
> 00000010  70 65 72 6c 20 2d 77 0a  75 73 65 20 73 74 72 69  |perl 
> -w.use stri|
> 00000020  63 74 3b 0a 75 73 65 20  65 6e 63 6f 64 69 6e 67  |ct;.use 
> encoding|
> 00000030  20 27 73 68 69 66 74 2d  6a 69 73 27 3b 0a 0a 6d  | 
> 'shift-jis';..m|
> 00000040  79 20 24 6e 61 6d 65 20  3d 20 22 94 5c 22 3b 0a  |y $name = 
> ".\";.|
> 00000050  70 72 69 6e 74 20 24 6e  61 6d 65 3b 0a 77 72 69  |print 
> $name;.wri|
> 00000060  74 65 3b 0a 0a 66 6f 72  6d 61 74 20 53 54 44 4f  
> |te;..format STDO|
> 00000070  55 54 20 3d 0a 94 5c 97  cd 3a 40 3c 3c 3c 0a 24  |UT 
> =..\..:@<<<.$|
> 00000080  6e 61 6d 65 0a 2e 0a                              |name...|

   The perl script is a valid perl script in Shift JIS but the quoted 
character (U+80fd, \x94\x5c in Shift_JIS) uses \x5c in the 2nd byte, 
mangling the script.  The encoding pragma needs to be parsable 
ASCII-wise.
   Fortunately, the encoding pragma offers a different approach via 
Filter=>1.  The problem is that Filter option was incomplete in two 
ways.

0.  Filter=>1 leaves STD(IN|OUT) untouched.  Not only does it leave 
STD* untouched it completely ignores STD*=> hooks that non-filter 
version offers.

1.  In order to touch STD(IN|OUT) sensibly you have to 'use utf8' in 
the script to make sure the literals therein are utf8-flagged but that 
makes the code too counterintuitive.

The following patch fixes that so the filter option is more useful.  I 
am planning to apply this patch to the next version of Encode but I 
still need to fix the POD and write test suites.  So I decided to issue 
a waring before committing a release.

Dan the Encode Maintainer

--- encoding.pm 2003/01/22 03:29:07     1.40
+++ encoding.pm 2003/01/26 07:03:59
@@ -35,33 +35,11 @@
      unless ($arg{Filter}) {
         ${^ENCODING} = $enc unless $] <= 5.008 and $utfs{$name};
         $HAS_PERLIO or return 1;
-       for my $h (qw(STDIN STDOUT)){
-           if ($arg{$h}){
-               unless (defined find_encoding($arg{$h})) {
-                   require Carp;
-                   Carp::croak("Unknown encoding for $h, '$arg{$h}'");
-               }
-               eval { binmode($h, ":encoding($arg{$h})") };
-           }else{
-               unless (exists $arg{$h}){
-                   eval {
-                       no warnings 'uninitialized';
-                       binmode($h, ":encoding($name)");
-                   };
-               }
-           }
-           if ($@){
-               require Carp;
-               Carp::croak($@);
-           }
-       }
      }else{
         defined(${^ENCODING}) and undef ${^ENCODING};
         eval {
             require Filter::Util::Call ;
             Filter::Util::Call->import ;
-           binmode(STDIN);
-           binmode(STDOUT);
             filter_add(sub{
                            my $status;
                             if (($status = filter_read()) > 0){
@@ -71,7 +49,31 @@
                            $status ;
                        });
         };
+       # internally use utf8 to make sure utf8 flags are set
+       # for literals.
+       use utf8 (); # to fetch $utf8::hint_bits;
+       $^H |= $utf8::hint_bits;
         # warn "Filter installed";
+    }
+    for my $h (qw(STDIN STDOUT)){
+       if ($arg{$h}){
+           unless (defined find_encoding($arg{$h})) {
+               require Carp;
+               Carp::croak("Unknown encoding for $h, '$arg{$h}'");
+           }
+           eval { binmode($h, ":encoding($arg{$h})") };
+       }else{
+           unless (exists $arg{$h}){
+               eval {
+                   no warnings 'uninitialized';
+                   binmode($h, ":encoding($name)");
+               };
+           }
+       }
+       if ($@){
+           require Carp;
+           Carp::croak($@);
+       }
      }
      return 1; # I doubt if we need it, though
  }


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About