develooper Front page | perl.perl5.porters | Postings from December 2000

Re: more UTF8 test suites and an UTF8 patch

Thread Previous | Thread Next
Jarkko Hietaniemi
December 29, 2000 22:15
Re: more UTF8 test suites and an UTF8 patch
Message ID:

The patch looks really impressive based on the description.

Of course, I wouldn't be earning my release manager title unless I
would find things to complain.  I haven't yet actually tried to apply
the patch or actually look at the patch, so I'll base these comments
on the description.

On Sat, Dec 30, 2000 at 02:27:10PM +0900, Inaba Hiroto wrote:
> Attached are UTF8 test suite and an UTF8 patch for perl@8223.
> The files in test suite are:
>   t/op/subst_utf8.t,
>   t/op/substr_utf8.t,
>   t/op/regexp_utf8.t + t/op/re_tests.utf8
> They are converted from t/op/{subst.t,substr.t,regexp.t,re_tests}
> simply translating ascii characters to unicode characters.  (In fact,
> they are "FULLWIDTH" characters code FF01-FF5E)

I don't like the idea in future of having to always remember to patch
the _utf8.t versions if someone patches the non-utf8 versions (or vice

Would it be somehow possible to automate the task so that there would
be some sort of template files from which both the byte and 'wide
character' version would be automagically produced?  (The template
could of course be the byte version, to save space)

> The files are UTF8-encoded so you need an UTF8 capable editor/terminal
> to see it.

Urrrgh.  Please, no.  I've just managed finally to undo this bad idea
in t/op/utf8decode.t.  Having binary bytes (take for example raw UTF-8)
is not a good idea.  Editors have problems, patch/diff have problems,
mailers have problems.  Let's not go there again.  Use \x{HHH} in Perl.

> [SNIP]

> pp_ctl.c:
>   In pp_regcomp(), use PMdf_DYN_UTF8 flag to set pm->op_pmdynflags
>   instead of PMdf_UTF8 flag.

If you have formed some sort of clear idea of the various UTF8 flags
(what each one is doing), please feel free to document them somewhere.

> [SNIP]

> toke.c:
>   In scan_const(), change `\x{...}' parsing logic.

While you are at it, could you change [\x{80}-\x{ff}] to produce/match
(string constants / regexes) bytes, not UTF-8 characters?  This way
it would be internally consistent with chr() and vstrings.

> [SNIP]

> Lastly, the new pragma I would like to propose in the patch is,
> lib/
>   `distinct' is a pragma to strictly distinguish UTF8 data and
>   non-UTF data.
>   Now any string which is SvUTF_off is equal to another string which
>   is SvUTF_on. `eq' can't distingush them.  This pragma forces all
>   SvUTF_on string differ to any SvUTF_off string.

Ummmm.  Introducing new pragmas should be considered carefully.
Can you give an example of how to use this pragma?  What problem
does it solve?

> With the patch, this pragma affects `eq' and `cmp' only.
> shoule be affect index, rindex, regexp match, etc?
> --
>  Inaba Hiroto <>

$jhi++; #
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About