I asked Inaba to prepare a patch for the "0x00..0xff shall be bytes for \x{}" case since he seemed to be genuinely good at UTF-8 :-) In his latest patch he came up with a new idea that might be a nice compromise: a new operator, qu. Inaba's message below, his patch attached. ----- Forwarded message from Inaba Hiroto <inaba@st.rim.or.jp> ----- From: Inaba Hiroto <inaba@st.rim.or.jp> Subject: patch for 0x00-0xff always produce bytes (Was: Re: One more patch for UTF8) Date: Sun, 14 Jan 2001 04:27:07 +0900 Message-ID: <3A60AC0A.6E8852B3@st.rim.or.jp> To: Jarkko Hietaniemi <jhi@iki.fi> X-Mailer: Mozilla 4.61 [en] (Win98; I) X-Accept-Language: en Inaba Hiroto wrote: > Jarkko Hietaniemi wrote: > > > I can dig your a pointer to the thread I mention above, but I think > > the policy is: > > > > "For the area 0x00-0xff always produce bytes." > > OK. I think I understand. Maybe I can make some (little) step for that > on this weekend. Then I made a patch for it, i.e. chr returns bytes for 0x00-0xff, vstring is bytes if all `revison' < 256, a string containing \x{0} to \x{ff} is bytes. and these behaviors are not affected by utf8 pragma. By bytes pragma, only chr changes it behavior to take lowest 8bit of arguments as charcode. And the patch does other fix and add/modifies feature. 1. Your recent fix for lvalue substr() with UTF8 is incomplete. So I made some other changes to mg.c and pp.c and add tests to t/op/substr.t. 2. Now pp_stringify and sv_setsv copies source's UTF8 flag even if IN_BYTE. pp_stringify is called from fold_constants at optimization phase and "\x{100}" was made SvUTF8_off under use bytes. I think the bytes pragma is for "byte semantics" and not for "do not produce UTF8 data". This change makes the funtion to_bytes in t/lib/chanames.t not working so I also modified it. 3. New `qu' operator to generate UTF8 string explicitly. Though I agree with the policy "0x00-0xff always produce bytes", sometimes want to such a string to be coded in UTF8. I can use pack"U0a*"but it requires more typing and has runtime overhead. 4. Fix pp_regcomp bug apeared by "0x00-0xff always produce bytes" change, The bug apears if a pm has PMdf_UTF8 flag but interporated string is not UTF8_on and has char 0x80-0xff. -- Inaba Hiroto <inaba@st.rim.or.jp> ----- End forwarded message ----- -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack CohenThread Next