develooper Front page | perl.perl5.porters | Postings from December 2000

more UTF8 test suites and an UTF8 patch

Thread Next
From:
Inaba Hiroto
Date:
December 29, 2000 21:17
Subject:
more UTF8 test suites and an UTF8 patch
Message ID:
3A4D722D.243AFD88@st.rim.or.jp
Attached are UTF8 test suite and an UTF8 patch for perl@8223.

The files in test suite are:
  t/op/subst_utf8.t,
  t/op/substr_utf8.t,
  t/op/regexp_utf8.t + t/op/re_tests.utf8

They are converted from t/op/{subst.t,substr.t,regexp.t,re_tests}
simply translating ascii characters to unicode characters.  (In fact,
they are "FULLWIDTH" characters code FF01-FF5E)

The files are UTF8-encoded so you need an UTF8 capable editor/terminal
to see it.

perl@8223 fails some of these tests. The patch fixes them. (I think).
And the patch does some feature changes, namely add a new pragma.

Following is description of the patch. (does not mention all fixes)

doop.c:
  HALF_UTF8_UPGRADE is deleted.

  Changed
    do_trans_simple(), do_trans_count(),
    do_trans_complex(), do_trans_simple_utf8(),
    do_trans_count_utf8(), do_trans_complex_utf8().

  In do_trans_simple_utf8() and do_trans_simple_utf8(),
    allocate destination area with
        New(0, d, len*3+UTF8_MAXLEN, U8)
    and call Renew if the length is not enough.
    (Initial factor 3 will be arguable)

mg.c:
  fix magic_regdatum_get() for @+ and @-.

  fix magic_setsubstr().

op.c:
  peep() fixed in case OP_HELEM: and case OP_HSLICE: for UTF8 hash
  keys.

pp.c:
  For pp_split(), even if rx->minlen == 1, optimization should
  not apply with utf8 regex.

pp_ctl.c:
  In pp_regcomp(), use PMdf_DYN_UTF8 flag to set pm->op_pmdynflags
  instead of PMdf_UTF8 flag.

  In pp_substcont(), SvUTF8_on(targ) when DO_UTF8(dstr).

  In die_where(), SvUTF8_on(ERRSV) if `use utf8'.

  In sv_compile_2op(), keep HINT_UTF8 flag in PL_hints.

pp_hot.c:
  In pp_concat(), call SvPV before DO_UTF8.

  In pp_match(), do not use CALLREG_INTUIT_START if TARG utf8-ness is
  not equal to regexp utf8-ness.

regcomp.c:
  Add `utf8' member to struct RExC_state_t.

  Introduce ANYOF_UNICODE_ALL flag and change cl_anything(),
  cl_and(), cl_or(), regclass(), regprop() to deal the flag.

  In pregcomp(), use both PMdf_DYN_UTF8 flag and PMdf_UTF8 flag as
  UTF8-regexp flag. Use EXACTF and EXACTFL node as startclass even if
  the regexp is UTF8.

  In regclass(), don't optimize inverted regclass for ANYOF_{charclass}.

regexec.c:
  Add reghop3(), reghopmaybe3() functions and related macros to
  explicitly specify limit.

  Several changes from simple pointer calculation to HOPx/CHR_DIST.
  (But I know it is not complete;-<)

  In find_byclass(), case EXACTF fix.

  In regmatch(), case REG_ANY, EXACT(|F|FL) and PLUS fix.

  In regrepeat(), check hardcount < max.

  In reginclass(), deal ANYOF_UNICODE_ALL flag.

(I think REGEXP match has some more slowdown, but not measured)

sv.c:
  In sv_setpv(|n)(), sv_usepvn(), sv_catpv(|n)(), use SvPOK_only_UTF8
  instead of SvPOK_only to `validate pointer'.

  In sv_eq and sv_cmp(), check PL_hints & HINT_UTF8_DISTINCT flag
  for distinct pragma (see below).

toke.c:
  In scan_const(), change `\x{...}' parsing logic.

  under `use utf8' and `no bytes', do
      o  <DATA> filehandle is automatically in ":utf8" mode
      o  Here document is SvUTF8_on.
      o  Some bareword is SvUTF8_on.
    (currently `bareword =>...' and `$h{bareword}')
      o  qw(...) elements are SvUTF8_on.

  In scan_trans(), OP_TRANS's op_private will have
 OPpTRANS_FROM_UTF flag if DO_UTF8(PL_lex_stuff)
 OPpTRANS_TO_UTF   flag if DO_UTF8(PL_lex_repl)

  vstring will be downgraded iff `no utf8' or `use bytes'.

utf8.c:
  In is_utf8_string(), allows len == 0 and call strlen().

Lastly, the new pragma I would like to propose in the patch is,

lib/distinct.pm:
  `distinct' is a pragma to strictly distinguish UTF8 data and
  non-UTF data.

  Now any string which is SvUTF_off is equal to another string which
  is SvUTF_on. `eq' can't distingush them.  This pragma forces all
  SvUTF_on string differ to any SvUTF_off string.

With the patch, this pragma affects `eq' and `cmp' only.
shoule be affect index, rindex, regexp match, etc?
--
 Inaba Hiroto <inaba@st.rim.or.jp>

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About