develooper Front page | perl.perl5.porters | Postings from January 2005

Re: [perl #33734] unpack fails on utf-8 strings

Thread Previous | Thread Next
From:
perl5-porters
Date:
January 22, 2005 13:34
Subject:
Re: [perl #33734] unpack fails on utf-8 strings
Message ID:
csseq0$77a$1@post.home.lunix
Here is a patch that makes unpack completely independent from whether
the string happens to be upgraded or not.

Things to note:
  - The trick of using unpack("C*", $string) to see "through" the encoding
    doesn't work anymore 
    (which is as it should be if unpack is encoding-neutral)
    You can run unpack under "use bytes" to get the old effect though.

  - The meaning of U0 and C0 were sort of swapped for unpack. Consider:
     with current perl:
      perl -wle 'print for unpack("U0U*", "\341\277\274")'
      8188
      perl -wle 'print for unpack("C0U*", "\341\277\274")'
      225
      191
      188

    However, if you look at CONSTRUCTING such strings:

      perl -wle 'use Devel::Peek; Dump(pack("U0U*", 8188))'
      SV = PV(0x8162464) at 0x81734b8
      REFCNT = 1
      FLAGS = (PADTMP,POK,pPOK,UTF8)
      PV = 0x816b760 "\341\277\274"\0 [UTF8 "\x{1ffc}"]
      CUR = 3
      LEN = 14
       => wrong string (right bytes internally but has utf8 on)

      perl -wle 'use Devel::Peek; Dump(pack("C0U*", 225, 191, 188))'
      SV = PV(0x8162464) at 0x81734d0
      REFCNT = 1
      FLAGS = (PADTMP,POK,pPOK)
      PV = 0x817b320 "\303\241\302\277\302\274"\0
      CUR = 6
      LEN = 18

       => completely wrong string

    However, if you switch the C0 and U0 around:
     perl -wle 'use Devel::Peek; Dump(pack("U0U*", 225, 191, 188))'
     SV = PV(0x8162464) at 0x81734d0
     REFCNT = 1
     FLAGS = (PADTMP,POK,pPOK,UTF8)
     PV = 0x817b320 "\303\241\302\277\302\274"\0 [UTF8 "\x{e1}\x{bf}\x{bc}"]
     CUR = 6
     LEN = 18
   
     THAT is the string "\341\277\274" (albeit encoded as UTF8, but I'm
     working towards an encoding neutral unpack)

     perl -wle 'use Devel::Peek; Dump(pack("C0U*", 8188))'
     SV = PV(0x8162464) at 0x81734b8
     REFCNT = 1
     FLAGS = (PADTMP,POK,pPOK)
     PV = 0x816b760 "\341\277\274"\0
     CUR = 3
     LEN = 14

     ALSO the string "\341\277\274"

     Since the pack behaviour looks right and is the one documented, I 
     conclude that the semantics for C0 and U0 in unpack are wrong (reversed).
     With the patch they behave like what I think is right:

     ./perl -Ilib -wle 'print for unpack("C0U*", "\341\277\274")'
     8188
     ./perl -Ilib -wle 'print for unpack("U0U*", "\341\277\274")'
     225
     191
     188

    Funny enough I only had to change one test for this 
    (perl-dev/t/uni/case.pl), which is not even a core pack test.

  - I dropped the "register" in several places. The utf8 helper functions
    take the address of s (the most important of these registers), so that one
    won't stay in a register anyways on naive compilers and the smart ones 
    don't need the hint.

    I also made all this little help variables more local, which I think
    reads easier and maybe allows the compiler a bit more leeway to choose
    good registers.

  - I made the checksum for the "C" format also distinguish on bits_in_uv

  - For the moment I made unpack "C" on a char >= 256 be an error, though
    later on I'd like to make it just basically do ord(). C is however
    the most likely format that users will notice the changed semantics, so 
    for a first version it might be good to keep it like this so users can
    flush out errors. (since A, a and Z checksumming is currently delegated
    to "C", that also implies that these won't checksum chars >= 256)

  - Fixed a minor bug in that checksumming over a partial byte in 'B' formats
    forgot to advance the pointer (needs to be applied to maint too)

  - U0 or starting with U on an octet string is completely yucky to implement
    as a special case in all formats, so I do these by upgrading a temporary 
    string. This is inefficient if you only unpack few things from a huge
    string (fortunately this case shouldn't be normal. pack with an initial
    U would return an already upgraded string, so this would only happen if
    the string later got somehow degraded). Still might be worth warning about
    in the docs.

  - a,A,Z now conserve unicodeness when extracting from an unicode packed 
    string (can't happen (currently) if the string was constructed with
    pack).
    e.g.:

    ./perl -Ilib -wle 'use Devel::Peek; Dump(unpack("a*", pack("U*", 8188)))'
    SV = PV(0x8162ef8) at 0x8162b00
    REFCNT = 1
    FLAGS = (TEMP,POK,pPOK,UTF8)
    PV = 0x8167bf0 "\341\277\274"\0 [UTF8 "\x{1ffc}"]
    CUR = 3
    LEN = 4

    It could be argued that I should try to do sv_utf8_downgrade(sv, 1)
    before returning, but I prefer the current code.

  - There's a few code formatting changes I was too lazy to filter out again

  - Relatively few tests and code in perl itself need to be updated.
    The ugliest one is in lib/CGI.Util,pm (what Util.pm is doing there
    is utterly broken anyways, I left it for compatibility. However, what
    it really should do is utf8::downgrade or use %u escapes I think)

Probably this has too big an impact to apply to stable (though I think that
all places where it makes a difference are bugs really), but I think it would
be proper for blead.

The patch is only half the work. There's also companion work that has to be
done for pack() on a, A and Z formats (if their argument is utf-8). But let's 
first see how acceptance of this one goes before I tackle that...
This is the most important and urgent part anyways.

diff -rubB perl-clean/ext/Encode/lib/Encode/MIME/Header.pm perl-dev/ext/Encode/lib/Encode/MIME/Header.pm
--- perl-clean/ext/Encode/lib/Encode/MIME/Header.pm	Thu May 20 15:51:06 2004
+++ perl-dev/ext/Encode/lib/Encode/MIME/Header.pm	Fri Jan 21 23:08:12 2005
@@ -166,7 +166,7 @@
     $chunk =~ s{
 		([^0-9A-Za-z])
 	       }{
-		   join("" => map {sprintf "=%02X", $_} unpack("C*", $1))
+		   join("" => map {sprintf "=%02X", $_} unpack("U0C*", $1))
 	       }egox;
     return decode_utf8(HEAD . 'Q?' . $chunk . TAIL);
 }
diff -rubB perl-clean/ext/Encode/t/encoding.t perl-dev/ext/Encode/t/encoding.t
--- perl-clean/ext/Encode/t/encoding.t	Thu Apr 18 16:12:01 2002
+++ perl-dev/ext/Encode/t/encoding.t	Fri Jan 21 22:56:33 2005
@@ -57,7 +57,7 @@
 print "ok 8\n";
 
 # the first octet of UTF-8 encoded 0x3af 
-print "not " unless unpack("C", chr(0xdf)) == 0xce;
+print "not " unless unpack("U0C", chr(0xdf)) == 0xce;
 print "ok 9\n";
 
 print "not " unless unpack("U", pack("U", 0xdf)) == 0xdf;
diff -rubB perl-clean/ext/Storable/t/utf8hash.t perl-dev/ext/Storable/t/utf8hash.t
--- perl-clean/ext/Storable/t/utf8hash.t	Fri Sep  5 22:02:53 2003
+++ perl-dev/ext/Storable/t/utf8hash.t	Fri Jan 21 23:21:00 2005
@@ -57,12 +57,12 @@
 foreach my $i (@ords){
     my $u = chr($i); utf8::upgrade($u);
     # warn sprintf "%d,%d", bytes::length($u), is_utf8($u);
-    my $b = pack("C*", unpack("C*", $u));
+    my $b = pack("C*", unpack("U0C*", $u));
     # warn sprintf "%d,%d" ,bytes::length($b), is_utf8($b);
 
     isnt($u,	                        $b, 
 	 "equivalence - with utf8flag");
-    is   (pack("C*", unpack("C*", $u)), pack("C*", unpack("C*", $b)),
+    is   ($u, pack("U0C*", unpack("C*", $b)),
 	  "equivalence - without utf8flag");
 
     $utf8hash{$u} = $utf8hash{$b} = $i;
diff -rubB perl-clean/ext/Unicode/Normalize/t/short.t perl-dev/ext/Unicode/Normalize/t/short.t
--- perl-clean/ext/Unicode/Normalize/t/short.t	Tue Jun  8 21:07:22 2004
+++ perl-dev/ext/Unicode/Normalize/t/short.t	Fri Jan 21 23:25:12 2005
@@ -35,7 +35,7 @@
 no warnings qw(utf8);
 
 # U+3042 is 3-byte length (in UTF-8/UTF-EBCDIC)
-our $a = pack 'U0C', unpack 'C', "\x{3042}";
+our $a = pack 'U0C', unpack 'U0C', "\x{3042}";
 
 print NFD($a) eq "\0"
    ? "ok" : "not ok", " 2\n";
diff -rubB perl-clean/lib/CGI/Util.pm perl-dev/lib/CGI/Util.pm
--- perl-clean/lib/CGI/Util.pm	Wed Jun  9 10:33:10 2004
+++ perl-dev/lib/CGI/Util.pm	Sat Jan 22 00:10:02 2005
@@ -197,10 +197,9 @@
 # URL-encode data
 sub escape {
   shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::DefaultClass));
-  my $toencode = shift;
-  return undef unless defined($toencode);
-  # force bytes while preserving backward compatibility -- dankogai
-  $toencode = pack("C*", unpack("C*", $toencode));
+  return undef unless defined($_[0]);
+  # force bytes while preserving backward compatibility
+  my $toencode = do { use bytes; "" . shift };
     if ($EBCDIC) {
       $toencode=~s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",$E2A[ord($1)])/eg;
     } else {
diff -rubB perl-clean/lib/Encode/MIME/Header.pm perl-dev/lib/Encode/MIME/Header.pm
--- perl-clean/lib/Encode/MIME/Header.pm	Thu May 20 15:51:06 2004
+++ perl-dev/lib/Encode/MIME/Header.pm	Fri Jan 21 23:08:12 2005
@@ -166,7 +166,7 @@
     $chunk =~ s{
 		([^0-9A-Za-z])
 	       }{
-		   join("" => map {sprintf "=%02X", $_} unpack("C*", $1))
+		   join("" => map {sprintf "=%02X", $_} unpack("U0C*", $1))
 	       }egox;
     return decode_utf8(HEAD . 'Q?' . $chunk . TAIL);
 }
diff -rubB perl-clean/pp_pack.c perl-dev/pp_pack.c
--- perl-clean/pp_pack.c	Fri Jan 14 13:32:50 2005
+++ perl-dev/pp_pack.c	Sat Jan 22 02:14:33 2005
@@ -57,8 +57,8 @@
 
 #if U16SIZE > SIZE16 || U32SIZE > SIZE32
 #  if BYTEORDER == 0x1234 || BYTEORDER == 0x12345678    /* little-endian */
-#    define OFF16(p)	(char*)(p)
-#    define OFF32(p)	(char*)(p)
+#    define OFF16(p)	((char*)(p))
+#    define OFF32(p)	((char*)(p))
 #  else
 #    if BYTEORDER == 0x4321 || BYTEORDER == 0x87654321  /* big-endian */
 #      define OFF16(p)	((char*)(p) + (sizeof(U16) - SIZE16))
@@ -67,25 +67,38 @@
        }}}} bad cray byte order
 #    endif
 #  endif
-#  define COPY16(s,p)  (*(p) = 0, Copy(s, OFF16(p), SIZE16, char))
-#  define COPY32(s,p)  (*(p) = 0, Copy(s, OFF32(p), SIZE32, char))
-#  define COPYNN(s,p,n) (*(p) = 0, Copy(s, (char *)(p), n, char))
-#  define CAT16(sv,p)  sv_catpvn(sv, OFF16(p), SIZE16)
-#  define CAT32(sv,p)  sv_catpvn(sv, OFF32(p), SIZE32)
-#else
-#  define COPY16(s,p)  Copy(s, p, SIZE16, char)
-#  define COPY32(s,p)  Copy(s, p, SIZE32, char)
-#  define COPYNN(s,p,n) Copy(s, (char *)(p), n, char)
-#  define CAT16(sv,p)  sv_catpvn(sv, (char*)(p), SIZE16)
-#  define CAT32(sv,p)  sv_catpvn(sv, (char*)(p), SIZE32)
+#else
+#  define OFF16(p)     ((char *) (p))
+#  define OFF32(p)     ((char *) (p))
 #endif
 
+#define COPY16(s,p)  Copy(s, OFF16(p), SIZE16, char)
+#define COPY32(s,p)  Copy(s, OFF32(p), SIZE32, char)
+#define CAT16(sv,p)  sv_catpvn(sv, OFF16(p), SIZE16)
+#define CAT32(sv,p)  sv_catpvn(sv, OFF32(p), SIZE32)
+
+/* Only to be used inside a loop (see the break) */
+#define COPYVAR(s,strend,utf8,var,format)		\
+STMT_START {						\
+    if (utf8) {						\
+        if (!next_uni_bytes(aTHX_ &s, strend,		\
+            (char *) &var, sizeof(var))) break;		\
+    } else {						\
+        Copy(s, (char *) &var, sizeof(var), char);	\
+        s += sizeof(var);				\
+    }							\
+    DO_BO_UNPACK(var, format);				\
+} STMT_END
+
+
 /* Avoid stack overflow due to pathological templates. 100 should be plenty. */
 #define MAX_SUB_TEMPLATE_LEVEL 100
 
 /* flags (note that type modifiers can also be used as flags!) */
+#define FLAG_UNPACK_WAS_UTF8    0x40	/* original had FLAG_UNPACK_DO_UTF8 */
+#define FLAG_UNPACK_PARSE_UTF8  0x20	/* Parse as utf8 */
 #define FLAG_UNPACK_ONLY_ONE  0x10
-#define FLAG_UNPACK_DO_UTF8   0x08
+#define FLAG_UNPACK_DO_UTF8     0x08	/* The underlying string is utf8 */
 #define FLAG_SLASH            0x04
 #define FLAG_COMMA            0x02
 #define FLAG_PACK             0x01
@@ -241,24 +254,23 @@
 
 /* Returns the sizeof() struct described by pat */
 STATIC I32
-S_measure_struct(pTHX_ register tempsym_t* symptr)
+S_measure_struct(pTHX_ tempsym_t* symptr)
 {
-    register I32 len = 0;
-    register I32 total = 0;
+    I32 len = 0;
+    I32 total = 0;
     int star;
 
-    register int size;
+    int size;
 
     while (next_symbol(symptr)) {
-
-        switch( symptr->howlen ){
+        switch (symptr->howlen) {
         case e_no_len:
 	case e_number:
 	    len = symptr->length;
 	    break;
         case e_star:
    	    Perl_croak(aTHX_ "Within []-length '*' not allowed in %s",
-                       symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                       symptr->flags & FLAG_PACK ? "pack" : "unpack");
             break;
         }
 
@@ -267,7 +279,7 @@
 	default:
             Perl_croak(aTHX_ "Invalid type '%c' in %s",
                        (int)TYPE_NO_MODIFIERS(symptr->code),
-                       symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                       symptr->flags & FLAG_PACK ? "pack" : "unpack");
 	case '@':
 	case '/':
 	case 'U':			/* XXXX Is it correct? */
@@ -275,12 +287,11 @@
 	case 'u':
 	    Perl_croak(aTHX_ "Within []-length '%c' not allowed in %s",
                        (int)symptr->code,
-                       symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                       symptr->flags & FLAG_PACK ? "pack" : "unpack");
 	case '%':
 	    size = 0;
 	    break;
-	case '(':
-	{
+	  case '(': {
             tempsym_t savsym = *symptr;
   	    symptr->patptr = savsym.grpbeg;
             symptr->patend = savsym.grpend;
@@ -302,7 +313,7 @@
 	    size = -1;
 	    if (total < len)
 		Perl_croak(aTHX_ "'X' outside of string in %s",
-                          symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+			   symptr->flags & FLAG_PACK ? "pack" : "unpack");
 	    break;
  	case 'x' | TYPE_IS_SHRIEKING:
  	    if (!len)			/* Avoid division by 0 */
@@ -432,7 +443,7 @@
  * returns char pointer to char after match, or NULL
  */
 STATIC char *
-S_group_end(pTHX_ register char *patptr, register char *patend, char ender)
+S_group_end(pTHX_ char *patptr, char *patend, char ender)
 {
     while (patptr < patend) {
 	char c = *patptr++;
@@ -461,7 +472,7 @@
  * Advances char pointer to 1st non-digit char and returns number
  */ 
 STATIC char *
-S_get_num(pTHX_ register char *patptr, I32 *lenptr )
+S_get_num(pTHX_ char *patptr, I32 *lenptr)
 {
   I32 len = *patptr++ - '0';
   while (isDIGIT(*patptr)) {
@@ -477,10 +488,10 @@
  * locates next template code and count
  */
 STATIC bool
-S_next_symbol(pTHX_ register tempsym_t* symptr )
+S_next_symbol(pTHX_ tempsym_t* symptr)
 {
-  register char* patptr = symptr->patptr; 
-  register char* patend = symptr->patend; 
+  char* patptr = symptr->patptr;
+  char* patend = symptr->patend;
 
   symptr->flags &= ~FLAG_SLASH;
 
@@ -498,26 +509,26 @@
       I32 code = *patptr++ & 0xFF;
       U32 inherited_modifiers = 0;
 
-      if (code == ','){ /* grandfather in commas but with a warning */
-	if (((symptr->flags & FLAG_COMMA) == 0) && ckWARN(WARN_UNPACK)){
+      if (code == ',') { /* grandfather in commas but with a warning */
+	if (!(symptr->flags & FLAG_COMMA) && ckWARN(WARN_UNPACK)) {
           symptr->flags |= FLAG_COMMA;
 	  Perl_warner(aTHX_ packWARN(WARN_UNPACK),
 	 	      "Invalid type ',' in %s",
-                      symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                      symptr->flags & FLAG_PACK ? "pack" : "unpack");
         }
 	continue;
       }
       
       /* for '(', skip to ')' */
       if (code == '(') {  
-        if( isDIGIT(*patptr) || *patptr == '*' || *patptr == '[' )
+        if (isDIGIT(*patptr) || *patptr == '*' || *patptr == '[')
           Perl_croak(aTHX_ "()-group starts with a count in %s",
-                     symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                     symptr->flags & FLAG_PACK ? "pack" : "unpack");
         symptr->grpbeg = patptr;
-        patptr = 1 + ( symptr->grpend = group_end(patptr, patend, ')') );
-        if( symptr->level >= MAX_SUB_TEMPLATE_LEVEL )
+        patptr = 1 + (symptr->grpend = group_end(patptr, patend, ')'));
+        if (symptr->level >= MAX_SUB_TEMPLATE_LEVEL)
 	  Perl_croak(aTHX_ "Too deeply nested ()-groups in %s",
-                     symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                     symptr->flags & FLAG_PACK ? "pack" : "unpack");
       }
 
       /* look for group modifiers to inherit */
@@ -547,28 +558,28 @@
             break;
         }
 
-        if (modifier == 0)
-          break;
+	  if (!modifier) break;
 
         if (!strchr(allowed, TYPE_NO_MODIFIERS(code)))
-          Perl_croak(aTHX_ "'%c' allowed only after types %s in %s", *patptr,
-                     allowed, symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+	      Perl_croak(aTHX_ "'%c' allowed only after types %s in %s", 
+			 *patptr, allowed, 
+			 symptr->flags & FLAG_PACK ? "pack" : "unpack");
 
         if (TYPE_ENDIANNESS(code | modifier) == TYPE_ENDIANNESS_MASK)
           Perl_croak(aTHX_ "Can't use both '<' and '>' after type '%c' in %s",
                      (int) TYPE_NO_MODIFIERS(code),
-                     symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+			 symptr->flags & FLAG_PACK ? "pack" : "unpack");
         else if (TYPE_ENDIANNESS(code | modifier | inherited_modifiers) ==
                  TYPE_ENDIANNESS_MASK)
           Perl_croak(aTHX_ "Can't use '%c' in a group with different byte-order in %s",
-                     *patptr, symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+			 *patptr, symptr->flags & FLAG_PACK ? "pack" : "unpack");
 
         if (ckWARN(WARN_UNPACK)) {
           if (code & modifier)
 	    Perl_warner(aTHX_ packWARN(WARN_UNPACK),
                         "Duplicate modifier '%c' after '%c' in %s",
                         *patptr, (int) TYPE_NO_MODIFIERS(code),
-                        symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+			      symptr->flags & FLAG_PACK ? "pack" : "unpack");
         }
 
         code |= modifier;
@@ -581,9 +592,8 @@
       /* look for count and/or / */ 
       if (patptr < patend) {
 	if (isDIGIT(*patptr)) {
- 	  patptr = get_num( patptr, &symptr->length );
+ 	  patptr = get_num(patptr, &symptr->length);
           symptr->howlen = e_number;
-
         } else if (*patptr == '*') {
           patptr++;
           symptr->howlen = e_star;
@@ -591,11 +600,11 @@
         } else if (*patptr == '[') {
           char* lenptr = ++patptr;            
           symptr->howlen = e_number;
-          patptr = group_end( patptr, patend, ']' ) + 1;
+          patptr = group_end( patptr, patend, ']') + 1;
           /* what kind of [] is it? */
           if (isDIGIT(*lenptr)) {
-            lenptr = get_num( lenptr, &symptr->length );
-            if( *lenptr != ']' )
+            lenptr = get_num( lenptr, &symptr->length);
+            if (*lenptr != ']')
               Perl_croak(aTHX_ "Malformed integer in [] in %s",
                          symptr->flags & FLAG_PACK ? "pack" : "unpack");
           } else {
@@ -627,7 +636,7 @@
               if (patptr < patend &&
                   (isDIGIT(*patptr) || *patptr == '*' || *patptr == '['))
                 Perl_croak(aTHX_ "'/' does not take a repeat count in %s",
-                           symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                           symptr->flags & FLAG_PACK ? "pack" : "unpack");
             }
             break;
 	  }
@@ -648,6 +657,33 @@
 }
 
 /*
+   There is no way to cleanly handle the case where we should process the 
+   string per octet in its upgraded form while it's really in downgraded form
+   (e.g. estimates like strend-s as an upper bound for the number of 
+   characters left wouldn't work). So if we foresee the need of this 
+   (pattern starts with U or contains U0), we want to work on the encoded 
+   version of the string. Users are advised to upgrade their pack string 
+   themselves if they need to do a lot of unpacks like this on it
+*/
+STATIC bool 
+need_utf8(const char *pat, const char *patend)
+{
+    if (pat >= patend) return FALSE;
+    if (*pat == 'U') return TRUE;
+    while (pat < patend) {
+	if (pat[0] == '#') {
+	    pat++;
+	    pat = memchr(pat, '\n', patend-pat);
+	    if (!pat) return FALSE;
+	} else if (pat[0] == 'U') {
+	    if (pat[1] == '0') return TRUE;
+	}
+	pat++;
+    }
+    return FALSE;
+}
+
+/*
 =for apidoc unpack_str
 
 The engine implementing unpack() Perl function. Note: parameters strbeg, new_s
@@ -656,14 +692,34 @@
 =cut */
 
 I32
-Perl_unpack_str(pTHX_ char *pat, register char *patend, register char *s, char *strbeg, char *strend, char **new_s, I32 ocnt, U32 flags)
+Perl_unpack_str(pTHX_ char *pat, char *patend, char *s, char *strbeg, char *strend, char **new_s, I32 ocnt, U32 flags)
 {
     tempsym_t sym = { 0 };
+
+    if (flags & FLAG_UNPACK_DO_UTF8) flags |= FLAG_UNPACK_WAS_UTF8;
+    else if (need_utf8(pat, patend)) {
+	/* We probably should try to avoid this in case a scalar context call
+	   wouldn't get to the "U0" */
+	STRLEN len = strend - s;
+	s = bytes_to_utf8(s, &len);
+	SAVEFREEPV(s);
+	strend = s + len;
+	flags |= FLAG_UNPACK_DO_UTF8;
+    }
+
+    if (pat < patend && *pat == 'U') {
+	/*
+	  if (!(flags & FLAG_UNPACK_DO_UTF8)) 
+	      Perl_croak(aTHX_ "U0 mode on a byte string");
+	  flags &= ~FLAG_UNPACK_PARSE_UTF8;
+	*/
+    } else if (flags & FLAG_UNPACK_DO_UTF8) flags |= FLAG_UNPACK_PARSE_UTF8;
+
     sym.patptr = pat;
     sym.patend = patend;
     sym.flags  = flags;
 
-    return unpack_rec(&sym, s, s, strend, NULL );
+    return unpack_rec(&sym, s, s, strend, NULL);
 }
 
 /*
@@ -676,77 +732,126 @@
 =cut */
 
 I32
-Perl_unpackstring(pTHX_ char *pat, register char *patend, register char *s, char *strend, U32 flags)
+Perl_unpackstring(pTHX_ char *pat, char *patend, char *s, char *strend, U32 flags)
 {
     tempsym_t sym = { 0 };
+
+    if (flags & FLAG_UNPACK_DO_UTF8) flags |= FLAG_UNPACK_WAS_UTF8;
+    else if (need_utf8(pat, patend)) {
+	/* We probably should try to avoid this in case a scalar context call
+	   wouldn't get to the "U0" */
+	STRLEN len = strend - s;
+	s = bytes_to_utf8(s, &len);
+	SAVEFREEPV(s);
+	strend = s + len;
+	flags |= FLAG_UNPACK_DO_UTF8;
+    }
+
+    if (pat < patend && *pat == 'U') {
+	/*
+	  if (!(flags & FLAG_UNPACK_DO_UTF8)) 
+	      Perl_croak(aTHX_ "U0 mode on a byte string");
+	  flags &= ~FLAG_UNPACK_PARSE_UTF8;
+	*/
+    } else if (flags & FLAG_UNPACK_DO_UTF8) flags |= FLAG_UNPACK_PARSE_UTF8;
+
     sym.patptr = pat;
     sym.patend = patend;
     sym.flags  = flags;
 
-    return unpack_rec(&sym, s, s, strend, NULL );
+    return unpack_rec(&sym, s, s, strend, NULL);
+}
+
+STATIC unsigned char
+next_uni_byte(pTHX_ char **s, const char *end, I32 datumtype)
+{
+    UV val;
+    STRLEN retlen;
+    val =
+	UNI_TO_NATIVE(utf8n_to_uvuni(*s, end-*s, &retlen,
+				     ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY));
+    /* We try to process malformed UTF-8 as much as possible (preferrably with
+       warnings), but these two mean we make no progress in the string and
+       might enter an infinite loop */
+    if (retlen == (STRLEN) -1 || retlen == 0)
+	Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+    if (val >= 0x100) Perl_croak(aTHX_ "'%c' applied to character value %"UVf,
+				 (int) datumtype, val);
+    *s += retlen;
+    return val;
+}
+
+STATIC bool
+next_uni_uu(pTHX_ char **s, const char *end, I32 *out)
+{
+    UV val;
+    STRLEN retlen;
+    char *from = *s;
+    val = UNI_TO_NATIVE(utf8n_to_uvuni(*s, end-*s, &retlen, UTF8_CHECK_ONLY));
+    if (val >= 0x100 || !ISUUCHAR(val) || 
+	retlen == (STRLEN) -1 || retlen == 0) {
+	*out = 0;
+	return FALSE;
+    }
+    *out = PL_uudmap[val] & 077;
+    *s = from;
+    return TRUE;
+}
+
+STATIC bool
+next_uni_bytes(pTHX_ char **s, const char *end, char *buf, int buf_len)
+{
+    UV val;
+    STRLEN retlen;
+    char *from = *s;
+    while (buf_len > 0) {
+	if (from >= end) return 1;
+	val = UNI_TO_NATIVE(utf8n_to_uvuni(from, end-from, &retlen,
+					   UTF8_CHECK_ONLY));
+	if (val >= 0x100 || retlen == (STRLEN) -1 || retlen == 0) return FALSE;
+	from += retlen;
+	*(unsigned char *)buf++ = val;
+	buf_len--;
+    }
+    *s = from;
+    return TRUE;
 }
 
 STATIC
 I32
-S_unpack_rec(pTHX_ register tempsym_t* symptr, register char *s, char *strbeg, char *strend, char **new_s )
+S_unpack_rec(pTHX_ tempsym_t* symptr, char *s, char *strbeg, char *strend, char **new_s)
 {
     dSP;
+
     I32 datumtype;
-    register I32 len = 0;
-    register I32 bits = 0;
-    register char *str;
+    I32 len = 0;
     SV *sv;
     I32 start_sp_offset = SP - PL_stack_base;
     howlen_t howlen;
 
-    /* These must not be in registers: */
-    I16 ai16;
-    U16 au16;
     I32 ai32;
-    U32 au32;
-#ifdef HAS_QUAD
-    Quad_t aquad;
-    Uquad_t auquad;
-#endif
-#if SHORTSIZE != SIZE16
-    short ashort;
-    unsigned short aushort;
-#endif
-    int aint;
-    unsigned int auint;
     long along;
-#if LONGSIZE != SIZE32
-    unsigned long aulong;
-#endif
-    char *aptr;
-    float afloat;
-    double adouble;
-#if defined(HAS_LONG_DOUBLE) && defined(USE_LONG_DOUBLE)
-    long double aldouble;
-#endif
-    IV aiv;
-    UV auv;
-    NV anv;
 
     I32 checksum = 0;
-    UV cuv = 0;
-    NV cdouble = 0.0;
+    UV cuv;
+    NV cdouble;
     const int bits_in_uv = 8 * sizeof(cuv);
     char* strrelbeg = s;
-    bool beyond = FALSE;
+    bool beyond, literal;
     bool unpack_only_one = (symptr->flags & FLAG_UNPACK_ONLY_ONE) != 0;
+    bool utf8 = (symptr->flags & FLAG_UNPACK_PARSE_UTF8) ? 1 : 0;
 
     while (next_symbol(symptr)) {
         datumtype = symptr->code;
 	/* do first one only unless in list context
-	   / is implemented by unpacking the count, then poping it from the
+	   / is implemented by unpacking the count, then popping it from the
 	   stack, so must check that we're not in the middle of a /  */
-        if ( unpack_only_one
-	     && (SP - PL_stack_base == start_sp_offset + 1)
-	     && (datumtype != '/') )   /* XXX can this be omitted */
+        if (unpack_only_one &&
+	    (SP - PL_stack_base == start_sp_offset + 1) &&
+	    (datumtype != '/'))   /* XXX can this be omitted */
             break;
 
-        switch( howlen = symptr->howlen ){
+        switch (howlen = symptr->howlen) {
         case e_no_len:
 	case e_number:
 	    len = symptr->length;
@@ -755,13 +860,14 @@
 	    len = strend - strbeg;	/* long enough */          
 	    break;
         }
-
+	/* Means the length comes from the pattern, not through / */
+	literal = 1;	
       redo_switch:
         beyond = s >= strend;
 	switch(TYPE_NO_ENDIANNESS(datumtype)) {
 	default:
-	    Perl_croak(aTHX_ "Invalid type '%c' in unpack", (int)TYPE_NO_MODIFIERS(datumtype) );
-
+	    Perl_croak(aTHX_ "Invalid type '%c' in unpack",
+		       (int) TYPE_NO_MODIFIERS(datumtype));
 	case '%':
 	    if (howlen == e_no_len)
 		len = 16;		/* len is not specified */
@@ -770,9 +876,7 @@
 	    cdouble = 0;
 	    continue;
 	    break;
-	case '(':
-	{
-	    char *ss = s;		/* Move from register */
+	  case '(': {
             tempsym_t savsym = *symptr;
 	    U32 group_modifiers = TYPE_MODIFIERS(datumtype & ~symptr->flags);
 	    symptr->flags |= group_modifiers;
@@ -781,45 +885,92 @@
 	    PUTBACK;
 	    while (len--) {
   	        symptr->patptr = savsym.grpbeg;
- 	        unpack_rec(symptr, ss, strbeg, strend, &ss );
-                if (ss == strend && savsym.howlen == e_star)
+		  unpack_rec(symptr, s, strbeg, strend, &s);
+		  if (s == strend && savsym.howlen == e_star)
 		    break; /* No way to continue */
 	    }
 	    SPAGAIN;
-	    s = ss;
 	    symptr->flags &= ~group_modifiers;
             savsym.flags = symptr->flags;
             *symptr = savsym;
 	    break;
 	}
 	case '@':
+	    if (utf8) {
+		s = strrelbeg;
+		while (len > 0) {
+		    if (s >= strend)
+			Perl_croak(aTHX_ "'@' outside of string in unpack");
+		    s += UTF8SKIP(s);
+		    len--;
+		}
+		if (s > strend)
+		    Perl_croak(aTHX_ "'@' outside of string with malformed UTF-8 in unpack");
+	    } else {
 	    if (len > strend - strrelbeg)
 		Perl_croak(aTHX_ "'@' outside of string in unpack");
 	    s = strrelbeg + len;
+	    }
 	    break;
  	case 'X' | TYPE_IS_SHRIEKING:
  	    if (!len)			/* Avoid division by 0 */
  		len = 1;
- 	    len = (s - strbeg) % len;
+	    if (utf8) {
+		char *hop, *last;
+		I32 l;
+		for (l=len, hop = strbeg; hop < s; l++, hop += UTF8SKIP(hop))
+		    if (l == len) {
+			last = hop;
+			l = 0;
+		    }
+		s = last;
+		break;
+	    } else len = (s - strbeg) % len;
  	    /* FALL THROUGH */
 	case 'X':
+	    if (utf8) {
+		while (len > 0) {
+		    if (s <= strbeg)
+			Perl_croak(aTHX_ "'X' outside of string in unpack");
+		    while (UTF8_IS_CONTINUATION(*--s)) {
+			if (s <= strbeg)
+			    Perl_croak(aTHX_ "'X' outside of string in unpack");
+		    }
+		    len--;
+		}
+	    } else {
 	    if (len > s - strbeg)
-		Perl_croak(aTHX_ "'X' outside of string in unpack" );
+		    Perl_croak(aTHX_ "'X' outside of string in unpack");
 	    s -= len;
+	    }
 	    break;
  	case 'x' | TYPE_IS_SHRIEKING:
  	    if (!len)			/* Avoid division by 0 */
  		len = 1;
- 	    aint = (s - strbeg) % len;
- 	    if (aint)			/* Other portable ways? */
- 		len = len - aint;
- 	    else
- 		len = 0;
+	    if (utf8) {
+		char *hop = strbeg;
+		I32 l = 0;
+		for (hop = strbeg; hop < s; hop += UTF8SKIP(hop)) l++;
+		if (s != hop)
+		    Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+		ai32 = l % len;
+	    } else ai32 = (s - strbeg) % len;
+	    if (ai32 == 0) break;
+	    len = len - ai32;
  	    /* FALL THROUGH */
 	case 'x':
+	    if (utf8) {
+		while (len>0) {
+		    if (s >= strend)
+			Perl_croak(aTHX_ "'x' outside of string in unpack");
+		    s += UTF8SKIP(s);
+		    len--;
+		}
+	    } else {
 	    if (len > strend - s)
 		Perl_croak(aTHX_ "'x' outside of string in unpack");
 	    s += len;
+	    }
 	    break;
 	case '/':
 	    Perl_croak(aTHX_ "'/' must follow a numeric type in unpack");
@@ -827,224 +978,287 @@
 	case 'A':
 	case 'Z':
 	case 'a':
-	    if (len > strend - s)
-		len = strend - s;
-	    if (checksum)
-		goto uchar_checksum;
-	    sv = newSVpvn(s, len);
-	    if (len > 0 && (datumtype == 'A' || datumtype == 'Z')) {
-		aptr = s;	/* borrow register */
-		if (datumtype == 'Z') {	/* 'Z' strips stuff after first null */
-		    s = SvPVX(sv);
-		    while (*s)
-			s++;
-		    if (howlen == e_star) /* exact for 'Z*' */
-		        len = s - SvPVX(sv) + 1;
+	    if (checksum) goto uchar_checksum; /* Won't handle >= 256 right */
+	    if (utf8) {
+		I32 l;
+		char *hop;
+		for (l=len, hop=s; l>0; l--, hop += UTF8SKIP(hop)) {
+		    if (hop >= strend) {
+			if (hop > strend)
+			    Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+			break;
 		}
-		else {		/* 'A' strips both nulls and spaces */
-		    s = SvPVX(sv) + len - 1;
-		    while (s >= SvPVX(sv) && (!*s || isSPACE(*s)))
-			s--;
-		    *++s = '\0';
 		}
-		SvCUR_set(sv, s - SvPVX(sv));
-		s = aptr;	/* unborrow register */
+		if (hop > strend)
+		    Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+		len = hop - s;
+	    } else if (len > strend - s)
+		len = strend - s;
+	    if (datumtype == 'Z') {
+		/* 'Z' strips stuff after first null */
+		char *ptr;
+		for (ptr = s; ptr < strend; ptr++) if (*ptr == 0) break;
+		sv = newSVpvn(s, ptr-s);
+		if (howlen == e_star) /* exact for 'Z*' */
+		    len = ptr-s + (ptr != strend ? 1 : 0);
+	    } else if (datumtype == 'A') {
+		/* 'A' strips both nulls and spaces */
+		char *ptr;
+		for (ptr = s+len-1; ptr >= s; ptr--)
+		    if (*ptr != 0 && !isSPACE(*ptr)) break;
+		ptr++;
+		sv = newSVpvn(s, ptr-s);
+	    } else sv = newSVpvn(s, len);
+	    if (utf8) {
+		SvUTF8_on(sv);
+		/* Undo any upgrade done due to need_utf8() */
+		if (!(symptr->flags & FLAG_UNPACK_WAS_UTF8))
+		    sv_utf8_downgrade(sv, 0);
 	    }
-	    s += len;
 	    XPUSHs(sv_2mortal(sv));
+	    s += len;
 	    break;
 	case 'B':
-	case 'b':
+	  case 'b': {
+	      char *str;
+	      /* Preliminary length estimate, acceptable for utf8 too */
 	    if (howlen == e_star || len > (strend - s) * 8)
 		len = (strend - s) * 8;
 	    if (checksum) {
 		if (!PL_bitcount) {
-		    Newz(601, PL_bitcount, 256, char);
-		    for (bits = 1; bits < 256; bits++) {
-			if (bits & 1)	PL_bitcount[bits]++;
-			if (bits & 2)	PL_bitcount[bits]++;
-			if (bits & 4)	PL_bitcount[bits]++;
-			if (bits & 8)	PL_bitcount[bits]++;
-			if (bits & 16)	PL_bitcount[bits]++;
-			if (bits & 32)	PL_bitcount[bits]++;
-			if (bits & 64)	PL_bitcount[bits]++;
-			if (bits & 128)	PL_bitcount[bits]++;
+		      int bits;
+		      Newz(601, PL_bitcount, 0x100, char);
+		      for (bits = 1; bits < 0x100; bits++) {
+			  int b = bits;
+			  while (b) {
+			      if (b & 1) PL_bitcount[bits]++;
+			      b >>= 1;
+			  }
+		      }
 		    }
+		  if (utf8) {
+		      while (len >= 8 && s < strend) {
+			  cuv += PL_bitcount[next_uni_byte(aTHX_ &s, strend, datumtype)];
+			  len -= 8;
 		}
+		  } else {
 		while (len >= 8) {
 		    cuv += PL_bitcount[*(unsigned char*)s++];
 		    len -= 8;
 		}
-		if (len) {
-		    bits = *s;
+		  }
+		  if (len && s < strend) {
+		      unsigned char bits;
+		      bits = utf8 ? next_uni_byte(aTHX_ &s, strend, datumtype) : *s++;
 		    if (datumtype == 'b') {
 			while (len-- > 0) {
 			    if (bits & 1) cuv++;
 			    bits >>= 1;
 			}
-		    }
-		    else {
+		      } else {
 			while (len-- > 0) {
-			    if (bits & 128) cuv++;
+			      if (bits & 0x80) cuv++;
 			    bits <<= 1;
 			}
 		    }
 		}
 		break;
 	    }
-	    sv = NEWSV(35, len + 1);
-	    SvCUR_set(sv, len);
+	      sv = sv_2mortal(NEWSV(35, len ? len : 1));
 	    SvPOK_on(sv);
 	    str = SvPVX(sv);
 	    if (datumtype == 'b') {
-		aint = len;
-		for (len = 0; len < aint; len++) {
-		    if (len & 7)		/*SUPPRESS 595*/
-			bits >>= 1;
-		    else
-			bits = *s++;
-		    *str++ = '0' + (bits & 1);
-		}
+		  unsigned char bits;
+		  ai32 = len;
+		  for (len = 0; len < ai32; len++) {
+		      if (len & 7) bits >>= 1;
+		      else if (utf8) {
+			  if (s >= strend) break;
+			  bits = next_uni_byte(aTHX_ &s, strend, datumtype);
+		      } else bits = *(unsigned char *) s++;
+		      *str++ = bits & 1 ? '1' : '0';
 	    }
-	    else {
-		aint = len;
-		for (len = 0; len < aint; len++) {
-		    if (len & 7)
-			bits <<= 1;
-		    else
-			bits = *s++;
-		    *str++ = '0' + ((bits & 128) != 0);
+	      } else {
+		  unsigned char bits;
+		  ai32 = len;
+		  for (len = 0; len < ai32; len++) {
+		      if (len & 7) bits <<= 1;
+		      else if (utf8) {
+			  if (s >= strend) break;
+			  bits = next_uni_byte(aTHX_ &s, strend, datumtype);
+		      } else bits = *(unsigned char *) s++;
+		      *str++ = bits & 0x80 ? '1' : '0';
 		}
 	    }
 	    *str = '\0';
-	    XPUSHs(sv_2mortal(sv));
+	      SvCUR_set(sv, str - SvPVX(sv));
+	      XPUSHs(sv);
 	    break;
+	  }
 	case 'H':
-	case 'h':
+	  case 'h': {
+	      char *str;
+	      /* Preliminary length estimate, acceptable for utf8 too */
 	    if (howlen == e_star || len > (strend - s) * 2)
 		len = (strend - s) * 2;
-	    sv = NEWSV(35, len + 1);
-	    SvCUR_set(sv, len);
+	      sv = sv_2mortal(NEWSV(35, len ? len : 1));
 	    SvPOK_on(sv);
 	    str = SvPVX(sv);
 	    if (datumtype == 'h') {
-		aint = len;
-		for (len = 0; len < aint; len++) {
-		    if (len & 1)
-			bits >>= 4;
-		    else
-			bits = *s++;
+		  unsigned char bits;
+		  ai32 = len;
+		  for (len = 0; len < ai32; len++) {
+		      if (len & 1) bits >>= 4;
+		      else if (utf8) {
+			  if (s >= strend) break;
+			  bits = next_uni_byte(aTHX_ &s, strend, datumtype);
+		      } else bits = * (unsigned char *) s++;
 		    *str++ = PL_hexdigit[bits & 15];
 		}
-	    }
-	    else {
-		aint = len;
-		for (len = 0; len < aint; len++) {
-		    if (len & 1)
-			bits <<= 4;
-		    else
-			bits = *s++;
+	      } else {
+		  unsigned char bits;
+		  ai32 = len;
+		  for (len = 0; len < ai32; len++) {
+		      if (len & 1) bits <<= 4;
+		      else if (utf8) {
+			  if (s >= strend) break;
+			  bits = next_uni_byte(aTHX_ &s, strend, datumtype);
+		      } else bits = *(unsigned char *) s++;
 		    *str++ = PL_hexdigit[(bits >> 4) & 15];
 		}
 	    }
 	    *str = '\0';
-	    XPUSHs(sv_2mortal(sv));
+	      SvCUR_set(sv, str - SvPVX(sv));
+	      XPUSHs(sv);
 	    break;
+	  }
 	case 'c':
-	    if (len > strend - s)
-		len = strend - s;
+	    /* Preliminary length estimate, acceptable for utf8 too */
+	    if (len > strend - s) len = strend - s;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
-	    while (len-- > 0) {
-		aint = *s++;
-		if (aint >= 128)	/* fake up signed chars */
-		    aint -= 256;
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSViv((IV)aint)));
+	    if (utf8) {
+		while (len-- > 0 && s < strend) {
+		    ai32 = next_uni_byte(aTHX_ &s, strend, 'c');
+		    if (ai32 >= 128)	/* fake up signed chars */
+			ai32 -= 256;
+		    if (!checksum)
+			PUSHs(sv_2mortal(newSViv((IV)ai32)));
+		    else if (checksum > bits_in_uv)
+			cdouble += (NV)ai32;
+		    else
+			cuv += ai32;
 		}
+	    } else {
+		while (len-- > 0) {
+		    ai32 = *s++;
+		    if (ai32 >= 128)	/* fake up signed chars */
+			ai32 -= 256;
+		    if (!checksum)
+			PUSHs(sv_2mortal(newSViv((IV)ai32)));
 		else if (checksum > bits_in_uv)
-		    cdouble += (NV)aint;
+			cdouble += (NV) ai32;
 		else
-		    cuv += aint;
+			cuv += ai32;
+		}
 	    }
 	    break;
 	case 'C':
-	unpack_C: /* unpack U will jump here if not UTF-8 */
             if (len == 0) {
-                symptr->flags &= ~FLAG_UNPACK_DO_UTF8;
-		break;
+		if (literal) {
+		    /* Switch to "natural" mode */
+		    if (symptr->flags & FLAG_UNPACK_DO_UTF8) {
+			symptr->flags |= FLAG_UNPACK_PARSE_UTF8;
+			utf8 = 1;
+		    } else {
+			symptr->flags &= ~FLAG_UNPACK_PARSE_UTF8;
+			utf8 = 0;
 	    }
-	    if (len > strend - s)
-		len = strend - s;
-	    if (checksum) {
-	      uchar_checksum:
-		while (len-- > 0) {
-		    auint = *s++ & 255;
-		    cuv += auint;
 		}
+		break;
 	    }
-	    else {
-                if (len && unpack_only_one)
-                    len = 1;
+	  uchar_checksum:
+	    /* Preliminary length estimate, acceptable for utf8 too */
+	    if (len > strend - s) len = strend - s;
+	    if (!checksum) {
+		if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
+	    }
+	    if (utf8) {
+		while (len-- > 0 && s < strend) {
+		    /* Change this to unicode char fetch once all unpack("C")
+		       bugs have been flushed out */
+		    unsigned char ch = next_uni_byte(aTHX_ &s, strend, 'C');
+		    if (!checksum)
+			PUSHs(sv_2mortal(newSVuv((UV) ch)));
+		    else if (checksum > bits_in_uv)
+			cdouble += (NV) ch;
+		    else 
+			cuv += ch;
+		}
+	    } else if (!checksum) {
 		while (len-- > 0) {
-		    auint = *s++ & 255;
-		    PUSHs(sv_2mortal(newSViv((IV)auint)));
+		    unsigned char ch = *(unsigned char *) s++;
+		    PUSHs(sv_2mortal(newSVuv((UV) ch)));
 		}
+	    } else if (checksum > bits_in_uv) {
+		while (len-- > 0) cdouble += (NV) *(unsigned char *) s++;
+	    } else {
+		while (len-- > 0) cuv += *(unsigned char *) s++;
 	    }
 	    break;
 	case 'U':
 	    if (len == 0) {
-                symptr->flags |= FLAG_UNPACK_DO_UTF8;
+		if (literal) {
+		    /* Switch to "bytes in utf-8" mode */
+		    if (symptr->flags & FLAG_UNPACK_DO_UTF8) {
+			utf8 = 0;
+			symptr->flags &= ~FLAG_UNPACK_PARSE_UTF8;
+		    } else {
+			/* Should be impossible due to the need_utf8() test */
+			Perl_croak(aTHX_ "U0 mode on a byte string");
+		    }
+		}
 		break;
 	    }
-	    if ((symptr->flags & FLAG_UNPACK_DO_UTF8) == 0)
-		 goto unpack_C;
-	    if (len > strend - s)
-		len = strend - s;
+	    if (len > strend - s) len = strend - s;
 	    if (!checksum) {
-		if (len && unpack_only_one)
-                    len = 1;
+		if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0 && s < strend) {
-		STRLEN alen;
-		auint = NATIVE_TO_UNI(utf8n_to_uvchr((U8*)s, strend - s, &alen, ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANYUV));
-		along = alen;
-		s += along;
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSVuv((UV)auint)));
-		}
+		STRLEN retlen;
+		UV auv = utf8n_to_uvuni((U8*)s, strend - s, &retlen, ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANYUV);
+		if (retlen == (STRLEN) -1 || retlen == 0)
+		    Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+		s += retlen;
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVuv((UV) auv)));
 		else if (checksum > bits_in_uv)
-		    cdouble += (NV)auint;
+		    cdouble += (NV) auv;
 		else
-		    cuv += auint;
+		    cuv += auv;
 	    }
 	    break;
 	case 's' | TYPE_IS_SHRIEKING:
 #if SHORTSIZE != SIZE16
 	    along = (strend - s) / sizeof(short);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		COPYNN(s, &ashort, sizeof(short));
-		DO_BO_UNPACK(ashort, s);
-		s += sizeof(short);
-		if (!checksum) {
+		short ashort;
+		COPYVAR(s, strend, utf8, ashort, s);
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)ashort)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)ashort;
 		else
@@ -1056,25 +1270,31 @@
 #endif
 	case 's':
 	    along = (strend - s) / SIZE16;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		I16 ai16;
+
+#if U16SIZE > SIZE16
+		ai16 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, 
+					OFF16(&ai16), SIZE16)) break;
+		} else {
 		COPY16(s, &ai16);
+		    s += SIZE16;
+		}
 		DO_BO_UNPACK(ai16, 16);
 #if U16SIZE > SIZE16
-		if (ai16 > 32767)
-		    ai16 -= 65536;
+		if (ai16 > 32767) ai16 -= 65536;
 #endif
-		s += SIZE16;
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)ai16)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)ai16;
 		else
@@ -1084,21 +1304,17 @@
 	case 'S' | TYPE_IS_SHRIEKING:
 #if SHORTSIZE != SIZE16
 	    along = (strend - s) / sizeof(unsigned short);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-		if (len && unpack_only_one)
-                    len = 1;
+		if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		COPYNN(s, &aushort, sizeof(unsigned short));
-		DO_BO_UNPACK(aushort, s);
-		s += sizeof(unsigned short);
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSViv((UV)aushort)));
-		}
+		unsigned short aushort;
+		COPYVAR(s, strend, utf8, aushort, s);
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVuv((UV) aushort)));
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)aushort;
 		else
@@ -1112,31 +1328,41 @@
 	case 'n':
 	case 'S':
 	    along = (strend - s) / SIZE16;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		U16 au16;
+#if U16SIZE > SIZE16
+		au16 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, 
+					OFF16(&au16), SIZE16)) break;
+		} else {
 		COPY16(s, &au16);
-		DO_BO_UNPACK(au16, 16);
 		s += SIZE16;
-#ifdef HAS_NTOHS
+		}
+		DO_BO_UNPACK(au16, 16);
 		if (datumtype == 'n')
+#ifdef HAS_NTOHS
 		    au16 = PerlSock_ntohs(au16);
+#else
+		Perl_croak(aTHX_ "'n' not supported on this platform");
 #endif
-#ifdef HAS_VTOHS
 		if (datumtype == 'v')
+#ifdef HAS_VTOHS
 		    au16 = vtohs(au16);
+#else
+		Perl_croak(aTHX_ "'v' not supported on this platform");
 #endif
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSViv((UV)au16)));
-		}
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVuv((UV)au16)));
 		else if (checksum > bits_in_uv)
-		    cdouble += (NV)au16;
+		    cdouble += (NV) au16;
 		else
 		    cuv += au16;
 	    }
@@ -1144,30 +1370,40 @@
 	case 'v' | TYPE_IS_SHRIEKING:
 	case 'n' | TYPE_IS_SHRIEKING:
 	    along = (strend - s) / SIZE16;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		I16 ai16;
+#if U16SIZE > SIZE16
+		ai16 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend,
+					(char *) &ai16, sizeof(ai16))) break;
+		} else {
 		COPY16(s, &ai16);
 		s += SIZE16;
-#ifdef HAS_NTOHS
+		}
 		if (datumtype == ('n' | TYPE_IS_SHRIEKING))
-		    ai16 = (I16)PerlSock_ntohs((U16)ai16);
+#ifdef HAS_NTOHS
+		    ai16 = (I16) PerlSock_ntohs((U16) ai16);
+#else
+                    Perl_croak(aTHX_ "'n!' not supported on this platform");
 #endif
-#ifdef HAS_VTOHS
 		if (datumtype == ('v' | TYPE_IS_SHRIEKING))
-		    ai16 = (I16)vtohs((U16)ai16);
+#ifdef HAS_VTOHS
+		    ai16 = (I16) vtohs((U16) ai16);
+#else
+                    Perl_croak(aTHX_ "'v!' not supported on this platform");
 #endif
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)ai16)));
-		}
 		else if (checksum > bits_in_uv)
-		    cdouble += (NV)ai16;
+		    cdouble += (NV) ai16;
 		else
 		    cuv += ai16;
 	    }
@@ -1175,18 +1411,15 @@
 	case 'i':
 	case 'i' | TYPE_IS_SHRIEKING:
 	    along = (strend - s) / sizeof(int);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		Copy(s, &aint, 1, int);
-		DO_BO_UNPACK(aint, i);
-		s += sizeof(int);
+		int aint;
+		COPYVAR(s, strend, utf8, aint, i);
 		if (!checksum) {
 		    PUSHs(sv_2mortal(newSViv((IV)aint)));
 		}
@@ -1199,18 +1432,15 @@
 	case 'I':
 	case 'I' | TYPE_IS_SHRIEKING:
 	    along = (strend - s) / sizeof(unsigned int);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		Copy(s, &auint, 1, unsigned int);
-		DO_BO_UNPACK(auint, i);
-		s += sizeof(unsigned int);
+		unsigned int auint;
+		COPYVAR(s, strend, utf8, auint, i);
 		if (!checksum) {
 		    PUSHs(sv_2mortal(newSVuv((UV)auint)));
 		}
@@ -1222,27 +1452,25 @@
 	    break;
 	case 'j':
 	    along = (strend - s) / IVSIZE;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		Copy(s, &aiv, 1, IV);
+		IV aiv;
 #if IVSIZE == INTSIZE
-		DO_BO_UNPACK(aiv, i);
+		COPYVAR(s, strend, utf8, aiv, i);
 #elif IVSIZE == LONGSIZE
-		DO_BO_UNPACK(aiv, l);
+		COPYVAR(s, strend, utf8, aiv, l);
 #elif defined(HAS_QUAD) && IVSIZE == U64SIZE
-		DO_BO_UNPACK(aiv, 64);
+		COPYVAR(s, strend, utf8, aiv, 64);
+#else
+		Perl_croak(aTHX_ "'j' not supported on this platform");
 #endif
-		s += IVSIZE;
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv(aiv)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)aiv;
 		else
@@ -1260,18 +1488,18 @@
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		Copy(s, &auv, 1, UV);
-#if UVSIZE == INTSIZE
-		DO_BO_UNPACK(auv, i);
-#elif UVSIZE == LONGSIZE
-		DO_BO_UNPACK(auv, l);
-#elif defined(HAS_QUAD) && UVSIZE == U64SIZE
-		DO_BO_UNPACK(auv, 64);
+		UV auv;
+#if IVSIZE == INTSIZE
+		COPYVAR(s, strend, utf8, auv, i);
+#elif IVSIZE == LONGSIZE
+		COPYVAR(s, strend, utf8, auv, l);
+#elif defined(HAS_QUAD) && IVSIZE == U64SIZE
+		COPYVAR(s, strend, utf8, auv, 64);
+#else
+		Perl_croak(aTHX_ "'J' not supported on this platform");
 #endif
-		s += UVSIZE;
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSVuv(auv)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)auv;
 		else
@@ -1281,21 +1509,17 @@
 	case 'l' | TYPE_IS_SHRIEKING:
 #if LONGSIZE != SIZE32
 	    along = (strend - s) / sizeof(long);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		COPYNN(s, &along, sizeof(long));
-		DO_BO_UNPACK(along, l);
-		s += sizeof(long);
-		if (!checksum) {
+		long along;
+		COPYVAR(s, strend, utf8, along, l);
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)along)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)along;
 		else
@@ -1307,25 +1531,30 @@
 #endif
 	case 'l':
 	    along = (strend - s) / SIZE32;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		I32 ai32;
+#if U32SIZE > SIZE32
+		ai32 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, 
+					OFF32(&ai32), SIZE32)) break;
+		} else {
 		COPY32(s, &ai32);
+		    s += SIZE32;
+		}
 		DO_BO_UNPACK(ai32, 32);
 #if U32SIZE > SIZE32
-		if (ai32 > 2147483647)
-		    ai32 -= 4294967296;
+		if (ai32 > 2147483647) ai32 -= 4294967296;
 #endif
-		s += SIZE32;
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)ai32)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)ai32;
 		else
@@ -1344,12 +1573,10 @@
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		COPYNN(s, &aulong, sizeof(unsigned long));
-		DO_BO_UNPACK(aulong, l);
-		s += sizeof(unsigned long);
-		if (!checksum) {
+		unsigned long aulog;
+		COPYVAR(s, strend, utf8, aulong, l);
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSVuv((UV)aulong)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)aulong;
 		else
@@ -1363,29 +1590,39 @@
 	case 'N':
 	case 'L':
 	    along = (strend - s) / SIZE32;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		U32 au32;
+#if U32SIZE > SIZE32
+		au32 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, 
+					OFF32(&au32), SIZE32)) break;
+		} else {
 		COPY32(s, &au32);
-		DO_BO_UNPACK(au32, 32);
 		s += SIZE32;
-#ifdef HAS_NTOHL
+		}
+		DO_BO_UNPACK(au32, 32);
 		if (datumtype == 'N')
+#ifdef HAS_NTOHL
 		    au32 = PerlSock_ntohl(au32);
+#else
+		    Perl_croak(aTHX_ "'N' not supported on this platform");
 #endif
-#ifdef HAS_VTOHL
 		if (datumtype == 'V')
+#ifdef HAS_VTOHL
 		    au32 = vtohl(au32);
+#else
+		    Perl_croak(aTHX_ "'V' not supported on this platform");
 #endif
-		 if (!checksum) {
+		if (!checksum)
 		     PUSHs(sv_2mortal(newSVuv((UV)au32)));
-		 }
 		 else if (checksum > bits_in_uv)
 		     cdouble += (NV)au32;
 		 else
@@ -1395,28 +1632,38 @@
 	case 'V' | TYPE_IS_SHRIEKING:
 	case 'N' | TYPE_IS_SHRIEKING:
 	    along = (strend - s) / SIZE32;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		I32 ai32;
+#if U32SIZE > SIZE32
+		ai32 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, 
+					OFF32(&ai32), SIZE32)) break;
+		} else {
 		COPY32(s, &ai32);
 		s += SIZE32;
-#ifdef HAS_NTOHL
+		}
 		if (datumtype == ('N' | TYPE_IS_SHRIEKING))
+#ifdef HAS_NTOHL
 		    ai32 = (I32)PerlSock_ntohl((U32)ai32);
+#else
+		    Perl_croak(aTHX_ "'N!' not supported on this platform");
 #endif
-#ifdef HAS_VTOHL
 		if (datumtype == ('V' | TYPE_IS_SHRIEKING))
+#ifdef HAS_VTOHL
 		    ai32 = (I32)vtohl((U32)ai32);
+#else
+		    Perl_croak(aTHX_ "'V!' not supported on this platform");
 #endif
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)ai32)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)ai32;
 		else
@@ -1425,48 +1672,54 @@
 	    break;
 	case 'p':
 	    along = (strend - s) / sizeof(char*);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    EXTEND(SP, len);
 	    EXTEND_MORTAL(len);
 	    while (len-- > 0) {
-		if (sizeof(char*) > strend - s)
-		    break;
-		else {
+		char *aptr;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend,
+					(char *) &aptr, sizeof(aptr))) break;
+		} else {
 		    Copy(s, &aptr, 1, char*);
-		    DO_BO_UNPACK_P(aptr);
-		    s += sizeof(char*);
+		    s += sizeof(aptr);
 		}
+		DO_BO_UNPACK_P(aptr);
 		/* newSVpv generates undef if aptr is NULL */
 		PUSHs(sv_2mortal(newSVpv(aptr, 0)));
 	    }
 	    break;
 	case 'w':
-            if (len && unpack_only_one)
-                len = 1;
+            if (len && unpack_only_one) len = 1;
 	    EXTEND(SP, len);
 	    EXTEND_MORTAL(len);
 	    {
 		UV auv = 0;
 		U32 bytes = 0;
 		
-		while ((len > 0) && (s < strend)) {
-		    auv = (auv << 7) | (*s & 0x7f);
+		while (len > 0 && s < strend) {
+		    unsigned char ch;
+		    ch = utf8 ? next_uni_byte(aTHX_ &s, strend, 'w') : 
+			*(unsigned char *)s++;
+		    auv = (auv << 7) | (ch & 0x7f);
 		    /* UTF8_IS_XXXXX not right here - using constant 0x80 */
-		    if ((U8)(*s++) < 0x80) {
+		    if (ch < 0x80) {
 			bytes = 0;
 			PUSHs(sv_2mortal(newSVuv(auv)));
 			len--;
 			auv = 0;
+			continue;
 		    }
-		    else if (++bytes >= sizeof(UV)) {	/* promote to string */
+		    if (++bytes >= sizeof(UV)) {	/* promote to string */
 			char *t;
 			STRLEN n_a;
 
 			sv = Perl_newSVpvf(aTHX_ "%.*"UVf, (int)TYPE_DIGITS(UV), auv);
 			while (s < strend) {
-			    sv = mul128(sv, (U8)(*s & 0x7f));
-			    if (!(*s++ & 0x80)) {
+			    ch = utf8 ? next_uni_byte(aTHX_ &s, strend, 'w') :
+				*(unsigned char *)s++;
+			    sv = mul128(sv, (U8)(ch & 0x7f));
+			    if (!(ch & 0x80)) {
 				bytes = 0;
 				break;
 			    }
@@ -1488,36 +1741,36 @@
 	    if (symptr->howlen == e_star)
 	        Perl_croak(aTHX_ "'P' must have an explicit size in unpack");
 	    EXTEND(SP, 1);
-	    if (sizeof(char*) > strend - s)
-		break;
-	    else {
+	    if (sizeof(char*) > strend - s) break;
+	    {
+		char *aptr;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, (char *) &aptr, 
+					sizeof(aptr))) break;
+		} else {
 		Copy(s, &aptr, 1, char*);
-		DO_BO_UNPACK_P(aptr);
-		s += sizeof(char*);
+		    s += sizeof(aptr);
 	    }
+		DO_BO_UNPACK_P(aptr);
 	    /* newSVpvn generates undef if aptr is NULL */
 	    PUSHs(sv_2mortal(newSVpvn(aptr, len)));
+	    }
 	    break;
 #ifdef HAS_QUAD
 	case 'q':
 	    along = (strend - s) / sizeof(Quad_t);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
                 EXTEND(SP, len);
                 EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		assert (s + sizeof(Quad_t) <= strend);
-		Copy(s, &aquad, 1, Quad_t);
-		DO_BO_UNPACK(aquad, 64);
-		s += sizeof(Quad_t);
-		if (!checksum) {
-                    PUSHs(sv_2mortal((aquad >= IV_MIN && aquad <= IV_MAX) ?
+		Quad_t aquad;
+		COPYVAR(s, strend, utf8, aquad, 64);
+		if (!checksum)
+                    PUSHs(sv_2mortal(aquad >= IV_MIN && aquad <= IV_MAX ?
 				     newSViv((IV)aquad) : newSVnv((NV)aquad)));
-                }
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)aquad;
 		else
@@ -1526,23 +1779,18 @@
 	    break;
 	case 'Q':
 	    along = (strend - s) / sizeof(Uquad_t);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
                 EXTEND(SP, len);
                 EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		assert (s + sizeof(Uquad_t) <= strend);
-		Copy(s, &auquad, 1, Uquad_t);
-		DO_BO_UNPACK(auquad, 64);
-		s += sizeof(Uquad_t);
-		if (!checksum) {
-		    PUSHs(sv_2mortal((auquad <= UV_MAX) ?
-				     newSVuv((UV)auquad) : newSVnv((NV)auquad)));
-		}
+		Uquad_t auquad;
+		COPYVAR(s, strend, utf8, auquad, 64);
+		if (!checksum)
+		    PUSHs(sv_2mortal(auquad <= UV_MAX ?
+				     newSVuv((UV)auquad):newSVnv((NV)auquad)));
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)auquad;
 		else
@@ -1553,90 +1801,99 @@
 	/* float and double added gnb@melba.bby.oz.au 22/11/89 */
 	case 'f':
 	    along = (strend - s) / sizeof(float);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		float afloat;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, (char *) &afloat, 
+					sizeof(afloat))) break;
+		} else {
 		Copy(s, &afloat, 1, float);
-		DO_BO_UNPACK_N(afloat, float);
 		s += sizeof(float);
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSVnv((NV)afloat)));
 		}
-		else {
+		DO_BO_UNPACK_N(afloat, float);
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVnv((NV)afloat)));
+		else
 		    cdouble += afloat;
 		}
-	    }
 	    break;
 	case 'd':
 	    along = (strend - s) / sizeof(double);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		double adouble;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, (char *) &adouble, 
+					sizeof(adouble))) break;
+		} else {
 		Copy(s, &adouble, 1, double);
-		DO_BO_UNPACK_N(adouble, double);
 		s += sizeof(double);
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSVnv((NV)adouble)));
 		}
-		else {
+		DO_BO_UNPACK_N(adouble, double);
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVnv((NV)adouble)));
+		else
 		    cdouble += adouble;
 		}
-	    }
 	    break;
 	case 'F':
 	    along = (strend - s) / NVSIZE;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		NV anv;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend,
+					(char *) &anv, sizeof(anv))) break;
+		} else {
 		Copy(s, &anv, 1, NV);
-		DO_BO_UNPACK_N(anv, NV);
 		s += NVSIZE;
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSVnv(anv)));
 		}
-		else {
+		DO_BO_UNPACK_N(anv, NV);
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVnv(anv)));
+		else
 		    cdouble += anv;
 		}
-	    }
 	    break;
 #if defined(HAS_LONG_DOUBLE) && defined(USE_LONG_DOUBLE)
 	case 'D':
 	    along = (strend - s) / LONG_DOUBLESIZE;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		long double aldouble;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, (char *) &aldouble, 
+					sizeof(aldouble))) break;
+		} else {
 		Copy(s, &aldouble, 1, long double);
-		DO_BO_UNPACK_N(aldouble, long double);
 		s += LONG_DOUBLESIZE;
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSVnv((NV)aldouble)));
-		}
-		else {cdouble += aldouble;
 		}
+		DO_BO_UNPACK_N(aldouble, long double);
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVnv((NV)aldouble)));
+		else
+		    cdouble += aldouble;
 	    }
 	    break;
 #endif
@@ -1660,8 +1917,34 @@
 
 	    along = (strend - s) * 3 / 4;
 	    sv = NEWSV(42, along);
-	    if (along)
-		SvPOK_on(sv);
+	    if (along) SvPOK_on(sv);
+	    if (utf8) {
+		while (next_uni_uu(aTHX_ &s, strend, &len)) {
+		    I32 a, b, c, d;
+		    char hunk[4];
+
+		    hunk[3] = '\0';
+		    while (len > 0) {
+			next_uni_uu(aTHX_ &s, strend, &a);
+			next_uni_uu(aTHX_ &s, strend, &b);
+			next_uni_uu(aTHX_ &s, strend, &c);
+			next_uni_uu(aTHX_ &s, strend, &d);
+			hunk[0] = (char)((a << 2) | (b >> 4));
+			hunk[1] = (char)((b << 4) | (c >> 2));
+			hunk[2] = (char)((c << 6) | d);
+			sv_catpvn(sv, hunk, (len > 3) ? 3 : len);
+			len -= 3;
+		    }
+		    if (s < strend) {
+			if (*s == '\n') s++;
+			else {
+			    /* possible checksum byte */
+			    char *skip = s+UTF8SKIP(s);
+			    if (skip < strend && *skip == '\n') s = skip+1;
+			}
+		    }
+		}
+	    } else {
 	    while (s < strend && *s > ' ' && ISUUCHAR(*s)) {
 		I32 a, b, c, d;
 		char hunk[4];
@@ -1697,6 +1980,7 @@
 		    if (s + 1 < strend && s[1] == '\n')
 		        s += 2;
 	    }
+	    }
 	    XPUSHs(sv_2mortal(sv));
 	    break;
 	}
@@ -1704,20 +1988,18 @@
 	if (checksum) {
 	    if (strchr("fFdD", TYPE_NO_MODIFIERS(datumtype)) ||
 	      (checksum > bits_in_uv &&
-	       strchr("csSiIlLnNUvVqQjJ", TYPE_NO_MODIFIERS(datumtype))) ) {
-		NV trouble;
+	       strchr("cCsSiIlLnNUvVqQjJ", TYPE_NO_MODIFIERS(datumtype)))) {
+		NV trouble, anv;
 
-                adouble = (NV) (1 << (checksum & 15));
+                anv = (NV) (1 << (checksum & 15));
 		while (checksum >= 16) {
 		    checksum -= 16;
-		    adouble *= 65536.0;
+		    anv *= 65536.0;
 		}
-		while (cdouble < 0.0)
-		    cdouble += adouble;
-		cdouble = Perl_modf(cdouble / adouble, &trouble) * adouble;
+		while (cdouble < 0.0) cdouble += anv;
+		cdouble = Perl_modf(cdouble / anv, &trouble) * anv;
 		sv = newSVnv(cdouble);
-	    }
-	    else {
+	    } else {
 		if (checksum < bits_in_uv) {
 		    UV mask = ((UV)1 << checksum) - 1;
 		    cuv &= mask;
@@ -1728,31 +2010,30 @@
 	    checksum = 0;
 	}
     
-        if (symptr->flags & FLAG_SLASH){
+        if (symptr->flags & FLAG_SLASH) {
             if (SP - PL_stack_base - start_sp_offset <= 0)
                 Perl_croak(aTHX_ "'/' must follow a numeric type in unpack");
-            if( next_symbol(symptr) ){
-              if( symptr->howlen == e_number )
-		Perl_croak(aTHX_ "Count after length/code in unpack" );
-              if( beyond ){
+            if (next_symbol(symptr)) {
+		if (symptr->howlen == e_number)
+		    Perl_croak(aTHX_ "Count after length/code in unpack");
+		if (beyond) {
          	/* ...end of char buffer then no decent length available */
-		Perl_croak(aTHX_ "length/code after end of string in unpack" );
+		    Perl_croak(aTHX_ "length/code after end of string in unpack");
               } else {
          	/* take top of stack (hope it's numeric) */
                 len = POPi;
-                if( len < 0 )
-                    Perl_croak(aTHX_ "Negative '/' count in unpack" );
-              }
-            } else {
-		Perl_croak(aTHX_ "Code missing after '/' in unpack" );
+		    if (len < 0)
+			Perl_croak(aTHX_ "Negative '/' count in unpack");
             }
+            } else
+		Perl_croak(aTHX_ "Code missing after '/' in unpack");
             datumtype = symptr->code;
+	    literal = 0;
 	    goto redo_switch;
         }
     }
 
-    if (new_s)
-	*new_s = s;
+    if (new_s) *new_s = s;
     PUTBACK;
     return SP - PL_stack_base - start_sp_offset;
 }
@@ -1764,32 +2045,24 @@
     I32 gimme = GIMME_V;
     STRLEN llen;
     STRLEN rlen;
-    register char *pat = SvPV(left, llen);
-#ifdef PACKED_IS_OCTETS
-    /* Packed side is assumed to be octets - so force downgrade if it
-       has been UTF-8 encoded by accident
-     */
-    register char *s = SvPVbyte(right, rlen);
-#else
-    register char *s = SvPV(right, rlen);
-#endif
+    char *pat = SvPV(left,  llen);
+    char *s   = SvPV(right, rlen);
     char *strend = s + rlen;
-    register char *patend = pat + llen;
-    register I32 cnt;
+    char *patend = pat + llen;
+    I32 cnt;
 
     PUTBACK;
     cnt = unpackstring(pat, patend, s, strend,
-		     ((gimme == G_SCALAR) ? FLAG_UNPACK_ONLY_ONE : 0)
-		     | (DO_UTF8(right) ? FLAG_UNPACK_DO_UTF8 : 0));
+		       (gimme == G_SCALAR ? FLAG_UNPACK_ONLY_ONE : 0) |
+		       (DO_UTF8(right)    ? FLAG_UNPACK_DO_UTF8  : 0));
 
     SPAGAIN;
-    if ( !cnt && gimme == G_SCALAR )
-       PUSHs(&PL_sv_undef);
+    if (!cnt && gimme == G_SCALAR) PUSHs(&PL_sv_undef);
     RETURN;
 }
 
 STATIC void
-S_doencodes(pTHX_ register SV *sv, register char *s, register I32 len)
+S_doencodes(pTHX_ SV *sv, char *s, I32 len)
 {
     char hunk[5];
 
@@ -1904,14 +2177,14 @@
 
 
 void
-Perl_pack_cat(pTHX_ SV *cat, char *pat, register char *patend, register SV **beglist, SV **endlist, SV ***next_in_list, U32 flags)
+Perl_pack_cat(pTHX_ SV *cat, char *pat, char *patend, SV **beglist, SV **endlist, SV ***next_in_list, U32 flags)
 {
     tempsym_t sym = { 0 };
     sym.patptr = pat;
     sym.patend = patend;
     sym.flags  = FLAG_PACK;
 
-    (void)pack_rec( cat, &sym, beglist, endlist );
+    (void)pack_rec( cat, &sym, beglist, endlist);
 }
 
 
@@ -1924,9 +2197,10 @@
 
 
 void
-Perl_packlist(pTHX_ SV *cat, char *pat, register char *patend, register SV **beglist, SV **endlist )
+Perl_packlist(pTHX_ SV *cat, char *pat, char *patend, SV **beglist, SV **endlist)
 {
     tempsym_t sym = { 0 };
+
     sym.patptr = pat;
     sym.patend = patend;
     sym.flags  = FLAG_PACK;
@@ -1931,17 +2205,17 @@
     sym.patend = patend;
     sym.flags  = FLAG_PACK;
 
-    (void)pack_rec( cat, &sym, beglist, endlist );
+    (void) pack_rec(cat, &sym, beglist, endlist);
 }
 
 
 STATIC
 SV **
-S_pack_rec(pTHX_ SV *cat, register tempsym_t* symptr, register SV **beglist, SV **endlist )
+S_pack_rec(pTHX_ SV *cat, tempsym_t* symptr, SV **beglist, SV **endlist)
 {
-    register I32 items;
+    I32 items;
     STRLEN fromlen;
-    register I32 len = 0;
+    I32 len = 0;
     SV *fromstr;
     /*SUPPRESS 442*/
     static char null10[] = {0,0,0,0,0,0,0,0,0,0};
@@ -1982,10 +2256,10 @@
     tempsym_t lookahead;
 
     items = endlist - beglist;
-    found = next_symbol( symptr );
+    found = next_symbol(symptr);
 
 #ifndef PACKED_IS_OCTETS
-    if (symptr->level == 0 && found && symptr->code == 'U' ){
+    if (symptr->level == 0 && found && symptr->code == 'U') {
 	SvUTF8_on(cat);
     }
 #endif
@@ -1997,7 +2271,7 @@
         I32 datumtype = symptr->code;
         howlen_t howlen;
 
-        switch( howlen = symptr->howlen ){
+        switch (howlen = symptr->howlen) {
         case e_no_len:
 	case e_number:
 	    len = symptr->length;
@@ -2010,10 +2284,10 @@
         /* Look ahead for next symbol. Do we have code/code? */
         lookahead = *symptr;
         found = next_symbol(&lookahead);
-	if ( symptr->flags & FLAG_SLASH ) {
-	    if (found){
- 	        if ( 0 == strchr( "aAZ", lookahead.code ) ||
-                     e_star != lookahead.howlen )
+	if ( symptr->flags & FLAG_SLASH) {
+	    if (found) {
+ 	        if ( 0 == strchr( "aAZ", lookahead.code) ||
+                     e_star != lookahead.howlen)
  		    Perl_croak(aTHX_ "'/' must be followed by 'a*', 'A*' or 'Z*' in pack");
 	        lengthcode = sv_2mortal(newSViv(sv_len(items > 0
 						   ? *beglist : &PL_sv_no)
@@ -2045,7 +2319,7 @@
             symptr->level++;
 	    while (len--) {
   	        symptr->patptr = savsym.grpbeg;
-		beglist = pack_rec(cat, symptr, beglist, endlist );
+		beglist = pack_rec(cat, symptr, beglist, endlist);
 		if (savsym.howlen == e_star && beglist == endlist)
 		    break;		/* No way to continue */
 	    }
@@ -2121,7 +2395,7 @@
 	case 'B':
 	case 'b':
 	    {
-		register char *str;
+		char *str;
 		I32 saveitems;
 
 		fromstr = NEXTFROM;
@@ -2177,7 +2451,7 @@
 	case 'H':
 	case 'h':
 	    {
-		register char *str;
+		char *str;
 		I32 saveitems;
 
 		fromstr = NEXTFROM;
@@ -2249,7 +2523,7 @@
 		    if ((aint < -128 || aint > 127) &&
 			ckWARN(WARN_PACK))
 		        Perl_warner(aTHX_ packWARN(WARN_PACK),
-				    "Character in 'c' format wrapped in pack" );
+				    "Character in 'c' format wrapped in pack");
 		    achar = aint & 255;
 		    sv_catpvn(cat, &achar, sizeof(char));
 		    break;
@@ -2722,10 +2997,10 @@
 PP(pp_pack)
 {
     dSP; dMARK; dORIGMARK; dTARGET;
-    register SV *cat = TARG;
+    SV *cat = TARG;
     STRLEN fromlen;
-    register char *pat = SvPVx(*++MARK, fromlen);
-    register char *patend = pat + fromlen;
+    char *pat = SvPVx(*++MARK, fromlen);
+    char *patend = pat + fromlen;
 
     MARK++;
     sv_setpvn(cat, "", 0);
diff -rubB perl-clean/t/op/pack.t perl-dev/t/op/pack.t
--- perl-clean/t/op/pack.t	Fri Jan 14 21:23:23 2005
+++ perl-dev/t/op/pack.t	Sat Jan 22 01:21:58 2005
@@ -835,7 +834,7 @@
 isnt(v1.20.300.4000, sprintf "%vd", pack("C0U*",1,20,300,4000));
 
 my $rslt = $Is_EBCDIC ? "156 67" : "199 162";
-is(join(" ", unpack("C*", chr(0x1e2))), $rslt);
+is(join(" ", unpack("U0C*", chr(0x1e2))), $rslt);
 
 # does pack U create Unicode?
 is(ord(pack('U', 300)), 300);
@@ -853,8 +852,8 @@
 SKIP: {
     skip "Not for EBCDIC", 4 if $Is_EBCDIC;
 
-    # does unpack C unravel pack U?
-    is("@{[unpack('C*', pack('U*', 100, 200))]}", "100 195 136");
+    # does unpack U0C unravel pack U?
+    is("@{[unpack('U0C*', pack('U*', 100, 200))]}", "100 195 136");
 
     # does pack U0C create Unicode?
     is("@{[pack('U0C*', 100, 195, 136)]}", v100.v200);
@@ -864,8 +863,9 @@
 
     # does unpack U0U on byte data warn?
     {
+        my $bad = pack("U0C", 255);
         local $SIG{__WARN__} = sub { $@ = "@_" };
-        my @null = unpack('U0U', chr(255));
+        my @null = unpack('U0U', $bad);
         like($@, qr/^Malformed UTF-8 character /);
     }
 }
diff -rubB perl-clean/t/uni/case.pl perl-dev/t/uni/case.pl
--- perl-clean/t/uni/case.pl	Fri Jan 21 22:44:38 2005
+++ perl-dev/t/uni/case.pl	Fri Jan 21 22:46:17 2005
@@ -64,7 +64,7 @@
 
     for my $i (sort keys %$spec) {
 	my $w = unidump($spec->{$i});
-	my $u = unpack "U0U", $i;
+	my $u = unpack "C0U", $i;
 	my $h = sprintf "%04X", $u;
 	my $c = chr($u); $c .= chr(0x100); chop $c;
 	my $d = $func->($c);

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About