develooper Front page | perl.perl5.porters | Postings from December 2004

Re: [perl #33185] UTF-8 string substitution corrupts memory

Thread Previous | Thread Next
Nicholas Clark
December 30, 2004 10:57
Re: [perl #33185] UTF-8 string substitution corrupts memory
Message ID:
On Sat, Dec 25, 2004 at 09:50:46PM -0000, sroy @ search-box. com wrote:

> The breakpoint stops the first time perl needs to check whether a
> utf8 character is part of a string class.  At this point (step #5) everything
> is ok.  By step #6 the value of PL_bostr (my_perl->Tbostr) is corrupted.
> To see more details, instead of c at step #6 do:
> 6. fin
> 7. s 4
> Now the debugger is sitting at the line that corrupts prog->startp.
> Ultimately, this corruption leads to a seg fault at pp_hot.c:2151 when perl
> tries to copy characters as part of the s/// operation.

I can recreate this on OS X when running with the perl debugger. I can't
recreate it on FreeBSD (on a box where valgrind has been installed) and
annoyingly the x86 Linux box I usually use for this sort of thing is
currently inaccessible.

> In the middle of processing the regular expression, The regex library
> demand-loads a bunch of stuff to create the swashes for the [:print:]
> expression.  At the end of all that PL_bostr has a completely new value.
> I have no idea whether the right fix is to move away from using PL_bostr
> in the regex library in favor of some local variable, or to try and
> save PL_bostr and restore it before any line that might change it.

Thanks for the analysis, which seems to be spot on. (Seems, because I'm no
expert on the regexp engine's guts).

Ideally we'd really like to re-write the regexp engine sufficiently to remove
all the global state, and hence make it totally re-entrant. Currently no-one
with the expertise to do this has the time.

Currently there are kludges to save enough state to theoretically make the
utf8 initialisation work:

/* XXX Here's a total kludge.  But we need to re-enter for swash routines. */

    SAVEI32(PL_reg_flags);		/* from regexec.c */
    SAVEPPTR(PL_reginput);		/* String-input pointer. */

but what doesn't make sense to me is why PL_bostr isn't being saved (or
maybe isn't being restored) via the code path that you code takes.

The realistic fix is going to be to make it save and restore correctly for
the class of operations that your code represents. I don't have the
experience with the regexp engine to know where to look to quickly find the
correct solution, but I believe that several other people on the
perl5-porters mailing list do.

Nicholas Clark

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About