develooper Front page | perl.perl5.porters | Postings from January 2008

Re: First class regexps

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
January 2, 2008 14:18
Subject:
Re: First class regexps
Message ID:
20080102221846.GO23703@plum.flirble.org
On Fri, Dec 28, 2007 at 01:10:07PM +0100, demerphq wrote:
> On 28/12/2007, Nicholas Clark <nick@ccl4.org> wrote:
> > So, as of change 32571, 5.11 now has first class regexps

> > So I was thinking what should go where. And whether actually the internal
> > regexp structure should merge. For starters, there is this at the end:
> >
> > typedef struct regexp {
> >         ...
> >         /* Refcount of this regexp */
> >         I32 refcnt;             /* Refcount of this regexp */
> > } regexp;
> >
> 
> Yes i think it would be nice to remove refcounting from the regexp
> struct and use the refcount slot from the sv.

Right. As of change 32804 this is done:

Change 32804 by nicholas@nicholas-bouvard on 2008/01/02 13:47:42

	Make struct regexp the body of SVt_REGEXP SVs, REGEXPs become SVs,
	and regexp reference counting is via the regular SV reference counting.
	This was not as easy at it looks.

(http://public.activestate.com/cgi-bin/perlbrowse/p/32804 for the full details)

Fate attempted to conspire against completion, because when I was fairly close
last night the picture on my monitor folded down into a bright spot and
vanished. I believe that my monitor is now pushing up daisies :-(

It was 11 and had a good life. Unfortunately attempts to source a stopgap
monitor* have only been partially successful, as my friend's spare monitor
can't cope with the resolution/refresh rate I had set up on my machine. I
have another friend...


Right now dumping them isn't very interesting:

$ ./perl -Ilib -MDevel::Peek -e 'Dump qr/Pie/'
SV = IV(0x811a1bc) at 0x811a1c0
  REFCNT = 1
  FLAGS = (TEMP,ROK)
  RV = 0x811a260
  SV = REGEXP(0x81396c0) at 0x811a260
    REFCNT = 2
    FLAGS = ()
    IV = 0
    NV = 0
    PV = 0

The REFCNT is 2 because a PMOP also owns a reference.

I'm not entirely sure what to do next. (Well, apart from having a break)


> Which means that stringification of a regex becomes almost exactly the
> same as the stringification of a PV (the only difference i can think
> of is how utf8 is encoded, but even that could hypotheticaly be done
> the same as PV's). Thus i can imagine this would result in a chunk of
> code just disappearing.

Yes, I think it would be good to use SvUTF(regexp) to mean that C<wrapped>
is UTF-8. It simplifies things.

Code has already disappeared - because REGEXPs are now SVs, there is no
need to special case how they are stored in mg_obj and correctly reference
counted.

> It would be cool to map the more important fields like you describe.
> We just need to decide which. Some are much less important than others
> so the extra indirection is irrelevent.  I think for now pretending
> these new SV types are as similar to PV's as possible, except for the
> extra pointer, would be the best. Once they are reuse as much code as
> possible we can figure out what to do with the unused slots.

I'm guessing that wrapped/wraplen goes into the PV, and the pattern-is-UTF8
flag into the UTF-8 flag bit. That leaves the "IV" and "NV" fields to use,
plus possibly the "private" SV flags bits 0xE0000000

Possibly premature optimisation, but as wrapped is down in this section:

        /* Information about the match that isn't often used */
	/* wrapped can't be const char*, as it is returned by sv_2pv_flags */
	char *wrapped;          /* wrapped version of the pattern */
	I32 wraplen;		/* length of wrapped */
	unsigned pre_prefix:4;	/* offset from wrapped to the start of precomp */
	unsigned seen_evals:28;	/* number of eval groups in the pattern - for security checks */ 
        HV *paren_names;	/* Optional hash of paren names */
} regexp;

maybe we should use IV and NV to store things from here?
In the SV structure, IV is in a union that can store a pointer, and NV is
in a union that also has two I32s. So perhaps paren_names goes in IV, but
the question is then which two 32 bit quantities in the NV slot.


Also, I see this:

#ifdef USE_ITHREADS
#define PM_GETRE(o)     (INT2PTR(REGEXP*,SvIVX(PL_regex_pad[(o)->op_pmoffset])))
#define PM_SETRE(o,r)   STMT_START { \
                            SV* const sv = PL_regex_pad[(o)->op_pmoffset]; \
                            sv_setiv(sv, PTR2IV(r)); \
                        } STMT_END

As REGEXPs are now SVs, can they be stored directly in the PADs without these
games?


Did I chose the best names for the accessor macros - RX_* and RXp_* ?

And we don't seem to have progressed on the bikeshed argument. I don't like
PLUM because it doesn't start with R, and it's not ORANGE. As to serious
suggestions I actually don't like REGEXP or REGEX because they are both
confusing with each other and the current "Regexp". PATTERN doesn't start
with an R, but RULE isn't as accurate (but is less typing). QR, RE and RX
all suggested themselves to me. And I can't think of any single word colours
starting with R. :-)


Nicholas Clark

*  (until I can go properly shopping, once I know what dot pitch I want
    on a flat panel to match the not-yet acquired laptop)

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About