Front page | perl.perl5.porters |
Postings from December 2000
more UTF8 test suites and an UTF8 patch
Thread Next
From:
Inaba Hiroto
Date:
December 29, 2000 21:17
Subject:
more UTF8 test suites and an UTF8 patch
Message ID:
3A4D722D.243AFD88@st.rim.or.jp
Attached are UTF8 test suite and an UTF8 patch for perl@8223.
The files in test suite are:
t/op/subst_utf8.t,
t/op/substr_utf8.t,
t/op/regexp_utf8.t + t/op/re_tests.utf8
They are converted from t/op/{subst.t,substr.t,regexp.t,re_tests}
simply translating ascii characters to unicode characters. (In fact,
they are "FULLWIDTH" characters code FF01-FF5E)
The files are UTF8-encoded so you need an UTF8 capable editor/terminal
to see it.
perl@8223 fails some of these tests. The patch fixes them. (I think).
And the patch does some feature changes, namely add a new pragma.
Following is description of the patch. (does not mention all fixes)
doop.c:
HALF_UTF8_UPGRADE is deleted.
Changed
do_trans_simple(), do_trans_count(),
do_trans_complex(), do_trans_simple_utf8(),
do_trans_count_utf8(), do_trans_complex_utf8().
In do_trans_simple_utf8() and do_trans_simple_utf8(),
allocate destination area with
New(0, d, len*3+UTF8_MAXLEN, U8)
and call Renew if the length is not enough.
(Initial factor 3 will be arguable)
mg.c:
fix magic_regdatum_get() for @+ and @-.
fix magic_setsubstr().
op.c:
peep() fixed in case OP_HELEM: and case OP_HSLICE: for UTF8 hash
keys.
pp.c:
For pp_split(), even if rx->minlen == 1, optimization should
not apply with utf8 regex.
pp_ctl.c:
In pp_regcomp(), use PMdf_DYN_UTF8 flag to set pm->op_pmdynflags
instead of PMdf_UTF8 flag.
In pp_substcont(), SvUTF8_on(targ) when DO_UTF8(dstr).
In die_where(), SvUTF8_on(ERRSV) if `use utf8'.
In sv_compile_2op(), keep HINT_UTF8 flag in PL_hints.
pp_hot.c:
In pp_concat(), call SvPV before DO_UTF8.
In pp_match(), do not use CALLREG_INTUIT_START if TARG utf8-ness is
not equal to regexp utf8-ness.
regcomp.c:
Add `utf8' member to struct RExC_state_t.
Introduce ANYOF_UNICODE_ALL flag and change cl_anything(),
cl_and(), cl_or(), regclass(), regprop() to deal the flag.
In pregcomp(), use both PMdf_DYN_UTF8 flag and PMdf_UTF8 flag as
UTF8-regexp flag. Use EXACTF and EXACTFL node as startclass even if
the regexp is UTF8.
In regclass(), don't optimize inverted regclass for ANYOF_{charclass}.
regexec.c:
Add reghop3(), reghopmaybe3() functions and related macros to
explicitly specify limit.
Several changes from simple pointer calculation to HOPx/CHR_DIST.
(But I know it is not complete;-<)
In find_byclass(), case EXACTF fix.
In regmatch(), case REG_ANY, EXACT(|F|FL) and PLUS fix.
In regrepeat(), check hardcount < max.
In reginclass(), deal ANYOF_UNICODE_ALL flag.
(I think REGEXP match has some more slowdown, but not measured)
sv.c:
In sv_setpv(|n)(), sv_usepvn(), sv_catpv(|n)(), use SvPOK_only_UTF8
instead of SvPOK_only to `validate pointer'.
In sv_eq and sv_cmp(), check PL_hints & HINT_UTF8_DISTINCT flag
for distinct pragma (see below).
toke.c:
In scan_const(), change `\x{...}' parsing logic.
under `use utf8' and `no bytes', do
o <DATA> filehandle is automatically in ":utf8" mode
o Here document is SvUTF8_on.
o Some bareword is SvUTF8_on.
(currently `bareword =>...' and `$h{bareword}')
o qw(...) elements are SvUTF8_on.
In scan_trans(), OP_TRANS's op_private will have
OPpTRANS_FROM_UTF flag if DO_UTF8(PL_lex_stuff)
OPpTRANS_TO_UTF flag if DO_UTF8(PL_lex_repl)
vstring will be downgraded iff `no utf8' or `use bytes'.
utf8.c:
In is_utf8_string(), allows len == 0 and call strlen().
Lastly, the new pragma I would like to propose in the patch is,
lib/distinct.pm:
`distinct' is a pragma to strictly distinguish UTF8 data and
non-UTF data.
Now any string which is SvUTF_off is equal to another string which
is SvUTF_on. `eq' can't distingush them. This pragma forces all
SvUTF_on string differ to any SvUTF_off string.
With the patch, this pragma affects `eq' and `cmp' only.
shoule be affect index, rindex, regexp match, etc?
--
Inaba Hiroto <inaba@st.rim.or.jp>
Thread Next
-
more UTF8 test suites and an UTF8 patch
by Inaba Hiroto