develooper Front page | perl.perl5.porters | Postings from February 2003

[Patch] parsing under encoding (Re: [Encode] HEADS-UP; $Encode::VERSION++ to enhance filter option)([perl #16823])

Thread Previous | Thread Next
Inaba HIroto
February 1, 2003 04:57
[Patch] parsing under encoding (Re: [Encode] HEADS-UP; $Encode::VERSION++ to enhance filter option)([perl #16823])
Message ID:
Dan Kogai wrote:

> Porters,
>    In the recent discussion in various perl-related MLs in Japanese, I
> have discovered a problem that the encoding pragma does not work on
> such multibyte encodings as Shift_JIS which uses 0x00-0x7f ranges in
> the 2nd byte.  Though not test I am pretty sure big5 is also prone to
> this.

<Skip sample script hex dump>

>    The perl script is a valid perl script in Shift JIS but the quoted
> character (U+80fd, \x94\x5c in Shift_JIS) uses \x5c in the 2nd byte,
> mangling the script.  The encoding pragma needs to be parsable
> ASCII-wise.
>    Fortunately, the encoding pragma offers a different approach via
> Filter=>1. ...

Attached patch(for breadperl @18609) is an attempt to fix this problem
Filter=>1 option.

It does:
  - Modify method_decode (Encode/Encode.xs) and do_encode
(Encode/encengine.c) to take
    terminator argument
  - Add a method cat_decode to Encoding object which take destination,
    offset and terminator as arguments.
    (Implemented packages are: Encode::XS, Encode::utf8 and
  - Add a function sv_cat_decode() to append decoded UTF8 string with offset
    terminator using method cat_decode.
  - When scan_str()  parses input with PL_encoding, use sv_cat_decode() with
    offset and specified terminator.
  -  In fact, I have started to make this patch for
         Subject: Re: [PATCH] [perl #16823] quote-operators don't work with
         Date: Sun, 1 Dec 2002 18:01:51 +0200
         From:Jarkko Hietaniemi <>
     So parsing under `use utf8' is also changed in scan_str().

 Though not concerns the main intent, modifies sv_recode_to_utf8() to
    - Change !DO_UTF8(sv) to !Sv_UTF8(sv) && !IN_BYTES
    - Add save_re_context()
    - Retract my useless code which checks UTF8_IS_INVARIANT
    Inaba Hiroto    <>

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About