develooper Front page | perl.perl5.porters | Postings from January 2004

[perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012]

Thread Next
Jesse Vincent
January 8, 2004 20:59
[perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012]
Message ID:
# New Ticket Created by  Jesse Vincent 
# Please include the string:  [perl #24846]
# in the subject line of all future correspondence about this issue. 
# <URL: >

Yeah, I'm still not quite sure I believe it myself, but IO::Scalar
exercises join with UTF8 and non-UTF8 data causing RT to end up with
corrupted attachments fairly often. After patching IO::Scalar to work
around this by emulating join using concatenation, the issue disappears.


From: Nicholas Adrian Vinen <>                                                                 

      I am a consultant for a company which uses RT for their internal
      support. They asked me to fix a problem they were having where
attaching binary files to a ticket caused the file to become corrupt
sometimes. They tracked it down to the case where the mod_perl
session which serves the request to add the attachment to the ticket
has previously been used to perform some ticket-related operation. I
finally tracked down this problem to a bug in perl. Here is a detailed
description of the problem:

      When you attach a file to a ticket using RT it saves the file you
      attach into a file into /tmp. It then adds a MIME::Body::File
record to the MIME::Entity which represents the ticket. Later,
it calls make_singlepart() on the MIME::Entity, which converts the
entity into a string. During this process, it calls as_string() on the
MIME::Body::File. This causes the file to be read in and printed into a
string using the IO::Scalar object. IO::Scalar's print() function calls
the function join() on the data as it is read in, before that data is
appended onto the destination string.

      The problem occurs inside join(). join() recycles string objects
      into which it does the joining, which it later returns. It never
touches the UTF8 flag on these strings. So, on the initial run, it has
no strings to recycle (or few), and when they are created they are set
to ASCII. So all the results of join() are ASCII, which is what MIME and
RT wants, as ASCII is also what is used for processing binary data. The
problem is, on the second and subsequent executions of RT within the perl
system, the recycled strings often have the UTF8 flag set. So, join ('',
$string), where $string is ASCII, will often return a UTF8 string. When
this UTF8 string is later converted into ASCII it is modified, and so
the binary data is corrupted.

      The solution is to apply the following patch to perl (tested with
      perl 5.8.2), which sets the UTF8 flag on the returned string to
something sensible.

diff -u perl-5.8.2/doop.c perl-5.8.2-patched/doop.c
--- perl-5.8.2/doop.c   2003-09-30 10:09:51.000000000 -0700
+++ perl-5.8.2-patched/doop.c   2004-01-05 23:23:13.000000000 -0800
@@ -647,6 +647,9 @@
     register STRLEN len;
     STRLEN delimlen;
     STRLEN tmplen;
+    int utf8;
+    utf8 = (SvUTF8(del)!=0);
     (void) SvPV(del, delimlen); /* stringify and get the delimlen */
     /* SvCUR assumes it's SvPOK() and woe betide you if it's not. */
@@ -674,22 +677,37 @@
     if (items-- > 0) {
-       if (*mark)
+       if (*mark) {
+           utf8 += (SvUTF8(*mark)!=0);
            sv_catsv(sv, *mark);
+       }
     if (delimlen) {
        for (; items > 0; items--,mark++) {
+           utf8 += (SvUTF8(*mark)!=0);
     else {
-       for (; items > 0; items--,mark++)
+       for (; items > 0; items--,mark++) {
+           utf8 += (SvUTF8(*mark)!=0);
+       }
+    if( utf8 )
+    {
+        if( utf8 != sp-oldmark+1 && ckWARN_d(WARN_UTF8) )
+       {
+           Perl_warner(aTHX_ packWARN(WARN_UTF8), "Joining UTF8 and ASCII strings");
+       }
+        SvUTF8_on(sv);
+    } else {
+        SvUTF8_off(sv);
+    }

      There may be other perl functions with similar problems; this is
      beyond the scope of my job, however I hope that the maintainers of
perl will be proactive in attempting to find and fix any similar problems,
as the way they have added UTF8 support to perl doesn't make it obvious
when such bugs exist. I'd say that any built-in function that returns
a string should be checked for (a) setting the UTF8 flag at all and (b)
whether the value it sets it to is sensible. Also I think warnings when
mixed types of strings are passed into functions are sensible as this can
be dangerous, and as we don't know what character set the ASCII strings
are in, the routines themselves can't really handle this case properly
if any extended characters are present.

      I hope this helps.


On Tue, Jan 06, 2004 at 01:46:22PM -0500, Jesse Vincent wrote:
> Hey Nicholas,
> 	Thanks very much for the patch. I've forwarded it on to some
> perl 5 porters who pushed back a bit. They've asked for a clear
> statement of _exactly_ what's being recycled, along with a simple
> testcase for reproducing the bug.   I'd be happy to contribute myself,
> but I'm really not a C person :/

    I wish I knew! I had a really hard time reading the perl code. Here
is what I could tell: somehow calling join() in perl causes Perl_do_join
to be called with 3 main arguments. One is the delimiter, on is the
array of strings to be joined, and one is a string into which to put the
result. I don't know where exactly it gets the string which is the
destination - the code calls a macro which I chased several levels deep.
This is the function which does it, as far as I can tell:

    dSP; dMARK; dTARGET;
    do_join(TARG, *MARK, MARK, SP);
    SP = MARK;

   PP() looks like this:

#define PP OP * Perl_##s(pTHX)

   pTHX looks like this:

#define pTHX register struct perl_thread *thr PERL_UNUSED_DECL

   TARG is the string which is being joined into. dTARGET is a macro
which looks like this:


   GETTARGET looks like this:

#define GETTARGET targ = PAD_SV(PL_op->op_targ)

   PAD_SV looks like this:

#define PAD_SV(po) (PL_curpad[po])

   PL_curpad looks like this:

#define PL_curpad (*Perl_Tcurpad_ptr(aTHX))

   aTHX is defined like this:

#define aTHX thr

   I think you can see why I wasn't very specific :(  it's a mess...
and here is where I lose the trail because I can't find
Perl_Tcurpad_ptr defined anywere. However, here is what I can tell. The
target comes out of the some 'curpad' member of 'thr' which is the
current perl thread context. I *think* curpad is like a stack and the
returned value from join goes onto the end of the stack. I think it is
these values which are being recycled. Certainly the target has to come
from somewhere, and it doesn't look like it is being allocated, it looks
to me like it's being taken from an array. op_targ seems to be which
element of the array that the result should go into but I'm not sure
what defines this. I'd have to dig a lot more to find out and I'm
already quite lost.

   The simple test case is this: install RT3 on a server with a single
process with MaxRequestsPerChildren>1, using mod_perl, and attach a
binary file to a ticket twice in a row :)   That's the simplest test
case I have. I've tried to reproduce it in small scripts but I can't,
and now that I've fixed it on the servers that I have access to, I don't
really want to break it again just to write a test case. I think,
however, the fact that the join function never sets or unsets the utf8
flag on its target string means that it can't be operating 100%
correctly. Anyway, this is as much information as I can give you as I
need to get on with the next project now.


P.S. I tried to cross-post the email I sent onto the rt-devel and
rt-users mailing lists but was rejected because I'm not subscribed to
them. Perhaps you can do that for me?

--  -- Trouble Ticketing. Free.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About