I'm sorry for late reply. Jarkko Hietaniemi wrote: > > They are converted from t/op/{subst.t,substr.t,regexp.t,re_tests} > > simply translating ascii characters to unicode characters. (In fact, > > they are "FULLWIDTH" characters code FF01-FF5E) > > I don't like the idea in future of having to always remember to patch > the _utf8.t versions if someone patches the non-utf8 versions (or vice > versa). Yes, I agree. > Would it be somehow possible to automate the task so that there would > be some sort of template files from which both the byte and 'wide > character' version would be automagically produced? (The template > could of course be the byte version, to save space) Now I'm working to make such template files for {subst.t,substr.t,regexp.t,re_tests} and its UTF-8 version. > > pp_ctl.c: > > In pp_regcomp(), use PMdf_DYN_UTF8 flag to set pm->op_pmdynflags > > instead of PMdf_UTF8 flag. > > If you have formed some sort of clear idea of the various UTF8 flags > (what each one is doing), please feel free to document them somewhere. I suppose PMdf_UTF8 flag means the regexp contains UTF8 data at script compile time. And PMdf_DYN_UTF8 flag (I introduced) means dynamicaly interpolated string is UTF8. > > toke.c: > > In scan_const(), change `\x{...}' parsing logic. > > While you are at it, could you change [\x{80}-\x{ff}] to produce/match > (string constants / regexes) bytes, not UTF-8 characters? This way > it would be internally consistent with chr() and vstrings. I think we can. (Though t/op/length.t test 7 assumes current behavior) > > Lastly, the new pragma I would like to propose in the patch is, > > > > lib/distinct.pm: > > `distinct' is a pragma to strictly distinguish UTF8 data and > > non-UTF data. > Ummmm. Introducing new pragmas should be considered carefully. Yes. > Can you give an example of how to use this pragma? What problem > does it solve? Actually, I have no real problem to solve with this pragma. I'll send a separate mail for this topic. -- Inaba Hiroto <inaba@st.rim.or.jp>Thread Previous