develooper Front page | perl.perl5.porters | Postings from July 2011

Re: GSOC Status Report, Week 5

Thread Previous | Thread Next
From:
Father Chrysostomos
Date:
July 3, 2011 13:40
Subject:
Re: GSOC Status Report, Week 5
Message ID:
C482034E-5501-41FD-802C-36F5D8C30BBA@cpan.org

On Jun 29, 2011, at 10:36 PM, Brian Fraser wrote:

> Howdy all.
> 
> Apologies for the late report. I screwed up and git cleaned/reseted away a bunch of mro tests and some changes, so I wanted to make up for the lost work before writing this.
> 
> mro is pretty much finished - A part of the tokenizer that isn't clean is causing some tests to die under strict, which I have mended locally to get the desired behavior, but it'll probably be saner to turn off strict in the final version, and leave toke.c changes to whenever I actually tackle that.
> 
> ->can, ->isa, call_method() and gv_fetchmethod_(autoload|flags) are now both UTF-8 and null clean; I forgot about ->DOES, so I'll be doing that soon, but I don't expect it to be problematic. There's a couple of error messages regarding versions that are still undone, as are error messages for scalar filehandles (I can't quite recall the details now, but last I checked I recall suspecting they might fix themselves post-merge with the pad stuff) and things dealing with SvUTF8() on globs.
> 
> Beyond that, I think there's only one thing pending implementation for stashes and GVs, which is somewhat related to the SvUTF8() issue: When to downgrade?
> Specifically: Right now, when initializing a stash/GV, [hg]v_name_set calls share_hek(), which does the downgrading as necessary. And that appears to be sufficient, as the rest of the gv/hv code is robust enough to handle whatever you throw at it.  Should I rely on that? The pad had to jump through a few hoops because of this issue, but it doesn't deal in heks.
> 
> But more importantly, it brings up another issue: When should I pass in the flag (from toke.c)? Simply doing UTF ? SVf_UTF8 : 0 is obviously wrong (you'll end up with a UTF-8 flagged _, for instance).

I’m not so sure that that *is* wrong. Having a UTF8-flagged _ is harmless. If toke.c has to check for non-ASCII characters to determine when to pass the flag, how is that different from having share_hek do the check, from an efficiency standpoint?

> I have a branch with a macro (unimaginatively named UTF_T) in toke.c that replaces those UTF with something like UTF_T(s, len, UTF), where UTF_T is defined as
> #define UTF_T(s,len,u) (u && !is_ascii_string((const U8*)s, len) && is_utf8_string((const U8*)s, len))
> Something like that - name notwithstanding - appears to make do just fine, but I've been wrong before.

See above. I think you’re spreading too much complexity around, but I could be wrong.

> 
> On a less than cheerful note, I've fallen behind schedule with tests, particularly regarding the new versions of several functions. I'll try to get back on track asap.
> 
> Also, I seem to have introduced a bug somewhere along the road that's breaking the test suite depending on how it's run, which is stopping me from celebrating anything; So, while I fix that, here are two semi-contentious changes that need addressing:
> 
> First, S_not_a_number calls sv_uni_display with a 0 for its flags. This means that now, if you try to do something like 1 + *Lèon, you'll get a fairly useless warning ('Argument "\x{2a}\x{6d}..." isn't numeric', etc, where \x{6d} is the 'm' of *main::Lèon).
> Would there be any objections to passing in UNI_DISPLAY_ISPRINT instead? It would get the right behavior for globs ('Argument "*main::L\x{e8}..." isn't numeric'), but it _will_ break tests outside of the core. Uh, probably.

That’s been brought up before. Several people (well, at least two) consider the current behaviour a bug, plain and simple, and no one has argued the other way. What you are suggesting is good, IMO.

> 
> Second, I need a change like this in test.pl:
> diff --git a/t/test.pl b/t/test.pl
> index 99d77cc..d9b9432 100644
> --- a/t/test.pl
> +++ b/t/test.pl
> @@ -750,9 +750,7 @@ sub _fresh_perl {
>      $runperl_args->{progfile} = $tmpfile;
>      $runperl_args->{stderr} = 1;
> 
> +    my $mode = $prog =~ /\P{ASCII}/ ? '>:utf8' : '>';
> +
> +    open TEST, $mode, $tmpfile or die "Cannot open $tmpfile: $!";
> -    open TEST, ">$tmpfile" or die "Cannot open $tmpfile: $!";
> 
>      # VMS adjustments
>      if( $is_vms ) {
> 
> (Though I suppose that \p{} might be a bit too much for test.pl. Maybe utf8::is_utf8($prog) will make do? Or a mode param?)
> Essentially, I require some way to say that it should write the program with one layer or another, otherwise, I'll have to split all the fresh_perl tests with UTF-8 into different files, and I dunno what the policy for changes to test.pl is.

That patch you are suggesting is the wrong way to do it. It means that a string containing "\xFF" won’t be encoded as UTF-8, but add one \x{100}, and the \xFF is suddenly encoded differently. While it may be OK for perl’s own tests (it not being a public API), I think it will lead to badly written tests later on.

You could add a new function or a new argument to the appropriate function in test.pl. But to keep things simpler it might be better to call utf8::encode on the program before passing it to test.pl. That way people reading the tests can see exactly what’s happening.

> 
> That's about it I think. This week (shortened as it is due to the lateness of this report) I'll be finishing the last TODOs, keep on writing tests, and start reviewing things for a preliminary version. If we can reach a consensus on the SvUTF8() & flag passing issues

Concerning the SvUTF8 flag on GVs (that’s what you mean, isn’t it), I just had a look in gv.h, and I don’t see any UTF8 flag that goes in GvFLAGS. (Based on what you said earlier, I had assumed there was, without even looking.) Am I missing something?

If I’m correct, then you can use SvUTF8(gv) itself to store the UTF8 flag and your problem is solved. Any time a GV is copied, however, you need to make sure that flag is copied too, just as for strings.


> (the latter, I suppose, could be left as it is until I get to toke.c), then aiming for a "cleanup done, review at will" mail next week isn't entirely insane. /motivation
> 


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About