develooper Front page | perl.perl5.porters | Postings from January 2001

qu operator? [inaba@st.rim.or.jp: patch for 0x00-0xff always produce bytes (Was: Re: One more patch for UTF8)]

Thread Next
From:
Jarkko Hietaniemi
Date:
January 13, 2001 19:36
Subject:
qu operator? [inaba@st.rim.or.jp: patch for 0x00-0xff always produce bytes (Was: Re: One more patch for UTF8)]
Message ID:
20010113213646.B27962@chaos.wustl.edu
I asked Inaba to prepare a patch for the "0x00..0xff shall be bytes
for \x{}" case since he seemed to be genuinely good at UTF-8 :-)

In his latest patch he came up with a new idea that might be a nice
compromise: a new operator, qu.  Inaba's message below, his patch attached.

----- Forwarded message from Inaba Hiroto <inaba@st.rim.or.jp> -----

From: Inaba Hiroto <inaba@st.rim.or.jp>
Subject: patch for 0x00-0xff always produce bytes (Was: Re: One more patch for 
 UTF8)
Date: Sun, 14 Jan 2001 04:27:07 +0900
Message-ID: <3A60AC0A.6E8852B3@st.rim.or.jp>
To: Jarkko Hietaniemi <jhi@iki.fi>
X-Mailer: Mozilla 4.61 [en] (Win98; I)
X-Accept-Language: en

Inaba Hiroto wrote:

> Jarkko Hietaniemi wrote:
>
> > I can dig your a pointer to the thread I mention above, but I think
> > the policy is:
> >
> >         "For the area 0x00-0xff always produce bytes."
>
> OK. I think I understand. Maybe I can make some (little) step for that
> on this weekend.

Then I made a patch for it, i.e.
   chr returns bytes for 0x00-0xff,
   vstring is bytes if all `revison' < 256,
   a string containing \x{0} to \x{ff} is bytes.
 and these behaviors are not affected by utf8 pragma.

 By bytes pragma, only chr changes it behavior
 to take lowest 8bit of arguments as charcode.

And the patch does other fix and add/modifies feature.

1.  Your recent fix for lvalue substr() with UTF8 is incomplete.
    So I made some other changes to mg.c and pp.c and add tests
    to t/op/substr.t.

2. Now pp_stringify and sv_setsv copies source's UTF8 flag even if IN_BYTE.
pp_stringify is called from fold_constants at optimization phase and
"\x{100}" was made SvUTF8_off under use bytes.

I think the bytes pragma is for "byte semantics" and not for "do not
produce UTF8 data".

This change makes the funtion to_bytes in t/lib/chanames.t not working
so I also modified it.

3. New `qu' operator to generate UTF8 string explicitly.
Though I agree with the policy "0x00-0xff always produce bytes",
sometimes want to such a string to be coded in UTF8.
I can use pack"U0a*"but it requires more typing and has runtime overhead.

4. Fix pp_regcomp bug apeared by "0x00-0xff always produce bytes" change,
The bug apears if a pm has PMdf_UTF8 flag but interporated string is not
UTF8_on and has char 0x80-0xff.
--
    Inaba Hiroto    <inaba@st.rim.or.jp>





----- End forwarded message -----

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About