develooper Front page | perl.perl5.porters | Postings from January 2005

encoding neutral unpack

Thread Next
From:
perl5-porters
Date:
January 22, 2005 13:50
Subject:
encoding neutral unpack
Message ID:
csuho9$c1h$1@post.home.lunix
Since it seems my original patch didn't get through to the list, here is
a repost with a patch to the latest version (so this includes the patch
in the mail that DID get through to the list)

This patch that makes unpack completely independent from whether the string 
happens to be upgraded or not.

Things to note:
  - The trick of using unpack("C*", $string) to see "through" the encoding
    doesn't work anymore 
    (which is as it should be if unpack is encoding-neutral)
    You can run unpack under "use bytes" to get the old effect though.

  - The meaning of U0 and C0 were sort of swapped for unpack. Consider:
     with current perl:
      perl -wle 'print for unpack("U0U*", "\341\277\274")'
      8188
      perl -wle 'print for unpack("C0U*", "\341\277\274")'
      225
      191
      188

    However, if you look at CONSTRUCTING such strings:

      perl -wle 'use Devel::Peek; Dump(pack("U0U*", 8188))'
      SV = PV(0x8162464) at 0x81734b8
      REFCNT = 1
      FLAGS = (PADTMP,POK,pPOK,UTF8)
      PV = 0x816b760 "\341\277\274"\0 [UTF8 "\x{1ffc}"]
      CUR = 3
      LEN = 14
       => wrong string (right bytes internally but has utf8 on)

      perl -wle 'use Devel::Peek; Dump(pack("C0U*", 225, 191, 188))'
      SV = PV(0x8162464) at 0x81734d0
      REFCNT = 1
      FLAGS = (PADTMP,POK,pPOK)
      PV = 0x817b320 "\303\241\302\277\302\274"\0
      CUR = 6
      LEN = 18

       => completely wrong string

    However, if you switch the C0 and U0 around:
     perl -wle 'use Devel::Peek; Dump(pack("U0U*", 225, 191, 188))'
     SV = PV(0x8162464) at 0x81734d0
     REFCNT = 1
     FLAGS = (PADTMP,POK,pPOK,UTF8)
     PV = 0x817b320 "\303\241\302\277\302\274"\0 [UTF8 "\x{e1}\x{bf}\x{bc}"]
     CUR = 6
     LEN = 18
   
     THAT is the string "\341\277\274" (albeit encoded as UTF8, but I'm
     working towards an encoding neutral unpack)

     perl -wle 'use Devel::Peek; Dump(pack("C0U*", 8188))'
     SV = PV(0x8162464) at 0x81734b8
     REFCNT = 1
     FLAGS = (PADTMP,POK,pPOK)
     PV = 0x816b760 "\341\277\274"\0
     CUR = 3
     LEN = 14

     ALSO the string "\341\277\274"

     Since the pack behaviour looks right and is the one documented, I 
     conclude that the semantics for C0 and U0 in unpack are wrong (reversed).
     With the patch they behave like what I think is right:

     ./perl -Ilib -wle 'print for unpack("C0U*", "\341\277\274")'
     8188
     ./perl -Ilib -wle 'print for unpack("U0U*", "\341\277\274")'
     225
     191
     188

    Funny enough I only had to change one test for this 
    (perl-dev/t/uni/case.pl), which is not even a core pack test.

  - I dropped the "register" in several places. The utf8 helper functions
    take the address of s (the most important of these registers), so that one
    won't stay in a register anyways on naive compilers and the smart ones 
    don't need the hint.

    I also made all this little help variables more local, which I think
    reads easier and maybe allows the compiler a bit more leeway to choose
    good registers.

  - I made the checksum for the "C" format also distinguish on bits_in_uv

  - For the moment I made unpack "C" on a char >= 256 be an error, though
    later on I'd like to make it just basically do ord(). C is however
    the most likely format that users will notice the changed semantics, so 
    for a first version it might be good to keep it like this so users can
    flush out errors. (since A, a and Z checksumming is currently delegated
    to "C", that also implies that these won't checksum chars >= 256)

  - Fixed a minor bug in that checksumming over a partial byte in 'B' formats
    forgot to advance the pointer (needs to be applied to maint too)

  - U0 or starting with U on an octet string is completely yucky to implement
    as a special case in all formats, so I do these by upgrading a temporary 
    string. This is inefficient if you only unpack few things from a huge
    string (fortunately this case shouldn't be normal. pack with an initial
    U would return an already upgraded string, so this would only happen if
    the string later got somehow degraded). Still might be worth warning about
    in the docs.

  - If the first character in the unpackstring is U it now behaves like
    there is an U0 before (this is consistent with the documentation
    and behaviour of pack, and needed for reversibility)

  - Only literal C0 and U0 cause a mode switch, not ones implied by
    something like: unpack("C/U", "\x00")
    (I consider that a bugfix that should also be applied to maint)

  - C0 and U0 modes are scoped to (), so in format "s(nU0v)2S", the U0 mode
    only applies to the v, NOT to the S or the n (in the second round)
    (I also consider the old behaviour here a bug. It made multiround
    groups and C/() style groups too unpredictable)

  - a,A,Z now conserve unicodeness when extracting from an unicode packed 
    string (can't happen (currently) if the string was constructed with
    pack).
    e.g.:

    ./perl -Ilib -wle 'use Devel::Peek; Dump(unpack("a*", pack("U*", 8188)))'
    SV = PV(0x8162ef8) at 0x8162b00
    REFCNT = 1
    FLAGS = (TEMP,POK,pPOK,UTF8)
    PV = 0x8167bf0 "\341\277\274"\0 [UTF8 "\x{1ffc}"]
    CUR = 3
    LEN = 4

    It could be argued that I should try to do sv_utf8_downgrade(sv, 1)
    before returning, but I prefer the current code. It allows proper 
    fixed format parsing of unicode (and byte) strings.

  - "A" stripped both NULL and any whitespace. I change it to NULL and space
    (documented as such, makes it more reversible from pack and I didn't 
    want to have to deal with the extended unicode definition of what a 
    space is)

  - There's a few code formatting changes I was too lazy to filter out again.
    Sorry for the noise lines in the patch.

  - Relatively few tests and code in perl itself need to be updated.
    The ugliest one is in lib/CGI.Util,pm (what Util.pm is doing there
    is utterly broken anyways, I left it for compatibility. However, what
    it really should do is utf8::downgrade or use %u escapes I think)

Probably this has too many semantics changes to apply it to to maint 
(though I think that all places where it makes a difference are bugs really), 
but I think it would be proper for blead.

The patch is only half the work. There's also companion work that has to be
done for pack() on a, A and Z formats (if their argument is utf-8). But let's 
first see how acceptance of this one goes before I tackle that...
This is the most important and urgent part anyways.
Probably also needs more tests for the new semantics (pack.t in fact hardly
notices the changes)

diff -rubB perl-clean/ext/Encode/lib/Encode/MIME/Header.pm perl-dev/ext/Encode/lib/Encode/MIME/Header.pm
--- perl-clean/ext/Encode/lib/Encode/MIME/Header.pm	Thu May 20 15:51:06 2004
+++ perl-dev/ext/Encode/lib/Encode/MIME/Header.pm	Fri Jan 21 23:08:12 2005
@@ -166,7 +166,7 @@
     $chunk =~ s{
 		([^0-9A-Za-z])
 	       }{
-		   join("" => map {sprintf "=%02X", $_} unpack("C*", $1))
+		   join("" => map {sprintf "=%02X", $_} unpack("U0C*", $1))
 	       }egox;
     return decode_utf8(HEAD . 'Q?' . $chunk . TAIL);
 }
diff -rubB perl-clean/ext/Encode/t/encoding.t perl-dev/ext/Encode/t/encoding.t
--- perl-clean/ext/Encode/t/encoding.t	Thu Apr 18 16:12:01 2002
+++ perl-dev/ext/Encode/t/encoding.t	Fri Jan 21 22:56:33 2005
@@ -57,7 +57,7 @@
 print "ok 8\n";
 
 # the first octet of UTF-8 encoded 0x3af 
-print "not " unless unpack("C", chr(0xdf)) == 0xce;
+print "not " unless unpack("U0C", chr(0xdf)) == 0xce;
 print "ok 9\n";
 
 print "not " unless unpack("U", pack("U", 0xdf)) == 0xdf;
diff -rubB perl-clean/ext/Storable/t/utf8hash.t perl-dev/ext/Storable/t/utf8hash.t
--- perl-clean/ext/Storable/t/utf8hash.t	Fri Sep  5 22:02:53 2003
+++ perl-dev/ext/Storable/t/utf8hash.t	Fri Jan 21 23:21:00 2005
@@ -57,12 +57,12 @@
 foreach my $i (@ords){
     my $u = chr($i); utf8::upgrade($u);
     # warn sprintf "%d,%d", bytes::length($u), is_utf8($u);
-    my $b = pack("C*", unpack("C*", $u));
+    my $b = pack("C*", unpack("U0C*", $u));
     # warn sprintf "%d,%d" ,bytes::length($b), is_utf8($b);
 
     isnt($u,	                        $b, 
 	 "equivalence - with utf8flag");
-    is   (pack("C*", unpack("C*", $u)), pack("C*", unpack("C*", $b)),
+    is   ($u, pack("U0C*", unpack("C*", $b)),
 	  "equivalence - without utf8flag");
 
     $utf8hash{$u} = $utf8hash{$b} = $i;
diff -rubB perl-clean/ext/Unicode/Normalize/t/short.t perl-dev/ext/Unicode/Normalize/t/short.t
--- perl-clean/ext/Unicode/Normalize/t/short.t	Tue Jun  8 21:07:22 2004
+++ perl-dev/ext/Unicode/Normalize/t/short.t	Fri Jan 21 23:25:12 2005
@@ -35,7 +35,7 @@
 no warnings qw(utf8);
 
 # U+3042 is 3-byte length (in UTF-8/UTF-EBCDIC)
-our $a = pack 'U0C', unpack 'C', "\x{3042}";
+our $a = pack 'U0C', unpack 'U0C', "\x{3042}";
 
 print NFD($a) eq "\0"
    ? "ok" : "not ok", " 2\n";
diff -rubB perl-clean/lib/CGI/Util.pm perl-dev/lib/CGI/Util.pm
--- perl-clean/lib/CGI/Util.pm	Wed Jun  9 10:33:10 2004
+++ perl-dev/lib/CGI/Util.pm	Sat Jan 22 02:57:21 2005
@@ -197,10 +197,9 @@
 # URL-encode data
 sub escape {
   shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::DefaultClass));
-  my $toencode = shift;
-  return undef unless defined($toencode);
-  # force bytes while preserving backward compatibility -- dankogai
-  $toencode = pack("C*", unpack("C*", $toencode));
+  return undef unless defined($_[0]);
+  # force bytes while preserving backward compatibility
+  my $toencode = do { use bytes; "" . shift };
     if ($EBCDIC) {
       $toencode=~s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",$E2A[ord($1)])/eg;
     } else {
diff -rubB perl-clean/lib/Encode/MIME/Header.pm perl-dev/lib/Encode/MIME/Header.pm
--- perl-clean/lib/Encode/MIME/Header.pm	Thu May 20 15:51:06 2004
+++ perl-dev/lib/Encode/MIME/Header.pm	Fri Jan 21 23:08:12 2005
@@ -166,8 +166,8 @@
     $chunk =~ s{
 		([^0-9A-Za-z])
 	       }{
-		   join("" => map {sprintf "=%02X", $_} unpack("C*", $1))
+		   join("" => map {sprintf "=%02X", $_} unpack("U0C*", $1))
 	       }egox;
     return decode_utf8(HEAD . 'Q?' . $chunk . TAIL);
 }
diff -rubB perl-clean/pp_pack.c perl-dev/pp_pack.c
--- perl-clean/pp_pack.c	Fri Jan 14 13:32:50 2005
+++ perl-dev/pp_pack.c	Sat Jan 22 16:06:04 2005
@@ -57,8 +57,8 @@
 
 #if U16SIZE > SIZE16 || U32SIZE > SIZE32
 #  if BYTEORDER == 0x1234 || BYTEORDER == 0x12345678    /* little-endian */
-#    define OFF16(p)	(char*)(p)
-#    define OFF32(p)	(char*)(p)
+#    define OFF16(p)	((char*)(p))
+#    define OFF32(p)	((char*)(p))
 #  else
 #    if BYTEORDER == 0x4321 || BYTEORDER == 0x87654321  /* big-endian */
 #      define OFF16(p)	((char*)(p) + (sizeof(U16) - SIZE16))
@@ -67,25 +67,38 @@
        }}}} bad cray byte order
 #    endif
 #  endif
-#  define COPY16(s,p)  (*(p) = 0, Copy(s, OFF16(p), SIZE16, char))
-#  define COPY32(s,p)  (*(p) = 0, Copy(s, OFF32(p), SIZE32, char))
-#  define COPYNN(s,p,n) (*(p) = 0, Copy(s, (char *)(p), n, char))
-#  define CAT16(sv,p)  sv_catpvn(sv, OFF16(p), SIZE16)
-#  define CAT32(sv,p)  sv_catpvn(sv, OFF32(p), SIZE32)
-#else
-#  define COPY16(s,p)  Copy(s, p, SIZE16, char)
-#  define COPY32(s,p)  Copy(s, p, SIZE32, char)
-#  define COPYNN(s,p,n) Copy(s, (char *)(p), n, char)
-#  define CAT16(sv,p)  sv_catpvn(sv, (char*)(p), SIZE16)
-#  define CAT32(sv,p)  sv_catpvn(sv, (char*)(p), SIZE32)
+#else
+#  define OFF16(p)     ((char *) (p))
+#  define OFF32(p)     ((char *) (p))
 #endif
 
+#define COPY16(s,p)  Copy(s, OFF16(p), SIZE16, char)
+#define COPY32(s,p)  Copy(s, OFF32(p), SIZE32, char)
+#define CAT16(sv,p)  sv_catpvn(sv, OFF16(p), SIZE16)
+#define CAT32(sv,p)  sv_catpvn(sv, OFF32(p), SIZE32)
+
+/* Only to be used inside a loop (see the break) */
+#define COPYVAR(s,strend,utf8,var,format)		\
+STMT_START {						\
+    if (utf8) {						\
+        if (!next_uni_bytes(aTHX_ &s, strend,		\
+            (char *) &var, sizeof(var))) break;		\
+    } else {						\
+        Copy(s, (char *) &var, sizeof(var), char);	\
+        s += sizeof(var);				\
+    }							\
+    DO_BO_UNPACK(var, format);				\
+} STMT_END
+
+
 /* Avoid stack overflow due to pathological templates. 100 should be plenty. */
 #define MAX_SUB_TEMPLATE_LEVEL 100
 
 /* flags (note that type modifiers can also be used as flags!) */
+#define FLAG_UNPACK_WAS_UTF8    0x40	/* original had FLAG_UNPACK_DO_UTF8 */
+#define FLAG_UNPACK_PARSE_UTF8  0x20	/* Parse as utf8 */
 #define FLAG_UNPACK_ONLY_ONE  0x10
-#define FLAG_UNPACK_DO_UTF8   0x08
+#define FLAG_UNPACK_DO_UTF8     0x08	/* The underlying string is utf8 */
 #define FLAG_SLASH            0x04
 #define FLAG_COMMA            0x02
 #define FLAG_PACK             0x01
@@ -241,24 +254,23 @@
 
 /* Returns the sizeof() struct described by pat */
 STATIC I32
-S_measure_struct(pTHX_ register tempsym_t* symptr)
+S_measure_struct(pTHX_ tempsym_t* symptr)
 {
-    register I32 len = 0;
-    register I32 total = 0;
+    I32 len = 0;
+    I32 total = 0;
     int star;
 
-    register int size;
+    int size;
 
     while (next_symbol(symptr)) {
-
-        switch( symptr->howlen ){
+        switch (symptr->howlen) {
         case e_no_len:
 	case e_number:
 	    len = symptr->length;
 	    break;
         case e_star:
    	    Perl_croak(aTHX_ "Within []-length '*' not allowed in %s",
-                       symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                       symptr->flags & FLAG_PACK ? "pack" : "unpack");
             break;
         }
 
@@ -267,7 +279,7 @@
 	default:
             Perl_croak(aTHX_ "Invalid type '%c' in %s",
                        (int)TYPE_NO_MODIFIERS(symptr->code),
-                       symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                       symptr->flags & FLAG_PACK ? "pack" : "unpack");
 	case '@':
 	case '/':
 	case 'U':			/* XXXX Is it correct? */
@@ -275,12 +287,11 @@
 	case 'u':
 	    Perl_croak(aTHX_ "Within []-length '%c' not allowed in %s",
                        (int)symptr->code,
-                       symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                       symptr->flags & FLAG_PACK ? "pack" : "unpack");
 	case '%':
 	    size = 0;
 	    break;
-	case '(':
-	{
+	  case '(': {
             tempsym_t savsym = *symptr;
   	    symptr->patptr = savsym.grpbeg;
             symptr->patend = savsym.grpend;
@@ -302,7 +313,7 @@
 	    size = -1;
 	    if (total < len)
 		Perl_croak(aTHX_ "'X' outside of string in %s",
-                          symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+			   symptr->flags & FLAG_PACK ? "pack" : "unpack");
 	    break;
  	case 'x' | TYPE_IS_SHRIEKING:
  	    if (!len)			/* Avoid division by 0 */
@@ -432,7 +443,7 @@
  * returns char pointer to char after match, or NULL
  */
 STATIC char *
-S_group_end(pTHX_ register char *patptr, register char *patend, char ender)
+S_group_end(pTHX_ char *patptr, char *patend, char ender)
 {
     while (patptr < patend) {
 	char c = *patptr++;
@@ -461,7 +472,7 @@
  * Advances char pointer to 1st non-digit char and returns number
  */ 
 STATIC char *
-S_get_num(pTHX_ register char *patptr, I32 *lenptr )
+S_get_num(pTHX_ char *patptr, I32 *lenptr)
 {
   I32 len = *patptr++ - '0';
   while (isDIGIT(*patptr)) {
@@ -477,10 +488,10 @@
  * locates next template code and count
  */
 STATIC bool
-S_next_symbol(pTHX_ register tempsym_t* symptr )
+S_next_symbol(pTHX_ tempsym_t* symptr)
 {
-  register char* patptr = symptr->patptr; 
-  register char* patend = symptr->patend; 
+  char* patptr = symptr->patptr;
+  char* patend = symptr->patend;
 
   symptr->flags &= ~FLAG_SLASH;
 
@@ -498,26 +509,26 @@
       I32 code = *patptr++ & 0xFF;
       U32 inherited_modifiers = 0;
 
-      if (code == ','){ /* grandfather in commas but with a warning */
-	if (((symptr->flags & FLAG_COMMA) == 0) && ckWARN(WARN_UNPACK)){
+      if (code == ',') { /* grandfather in commas but with a warning */
+	if (!(symptr->flags & FLAG_COMMA) && ckWARN(WARN_UNPACK)) {
           symptr->flags |= FLAG_COMMA;
 	  Perl_warner(aTHX_ packWARN(WARN_UNPACK),
 	 	      "Invalid type ',' in %s",
-                      symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                      symptr->flags & FLAG_PACK ? "pack" : "unpack");
         }
 	continue;
       }
       
       /* for '(', skip to ')' */
       if (code == '(') {  
-        if( isDIGIT(*patptr) || *patptr == '*' || *patptr == '[' )
+        if (isDIGIT(*patptr) || *patptr == '*' || *patptr == '[')
           Perl_croak(aTHX_ "()-group starts with a count in %s",
-                     symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                     symptr->flags & FLAG_PACK ? "pack" : "unpack");
         symptr->grpbeg = patptr;
-        patptr = 1 + ( symptr->grpend = group_end(patptr, patend, ')') );
-        if( symptr->level >= MAX_SUB_TEMPLATE_LEVEL )
+        patptr = 1 + (symptr->grpend = group_end(patptr, patend, ')'));
+        if (symptr->level >= MAX_SUB_TEMPLATE_LEVEL)
 	  Perl_croak(aTHX_ "Too deeply nested ()-groups in %s",
-                     symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                     symptr->flags & FLAG_PACK ? "pack" : "unpack");
       }
 
       /* look for group modifiers to inherit */
@@ -547,28 +558,28 @@
             break;
         }
 
-        if (modifier == 0)
-          break;
+	  if (!modifier) break;
 
         if (!strchr(allowed, TYPE_NO_MODIFIERS(code)))
-          Perl_croak(aTHX_ "'%c' allowed only after types %s in %s", *patptr,
-                     allowed, symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+	      Perl_croak(aTHX_ "'%c' allowed only after types %s in %s", 
+			 *patptr, allowed, 
+			 symptr->flags & FLAG_PACK ? "pack" : "unpack");
 
         if (TYPE_ENDIANNESS(code | modifier) == TYPE_ENDIANNESS_MASK)
           Perl_croak(aTHX_ "Can't use both '<' and '>' after type '%c' in %s",
                      (int) TYPE_NO_MODIFIERS(code),
-                     symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+			 symptr->flags & FLAG_PACK ? "pack" : "unpack");
         else if (TYPE_ENDIANNESS(code | modifier | inherited_modifiers) ==
                  TYPE_ENDIANNESS_MASK)
           Perl_croak(aTHX_ "Can't use '%c' in a group with different byte-order in %s",
-                     *patptr, symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+			 *patptr, symptr->flags & FLAG_PACK ? "pack" : "unpack");
 
         if (ckWARN(WARN_UNPACK)) {
           if (code & modifier)
 	    Perl_warner(aTHX_ packWARN(WARN_UNPACK),
                         "Duplicate modifier '%c' after '%c' in %s",
                         *patptr, (int) TYPE_NO_MODIFIERS(code),
-                        symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+			      symptr->flags & FLAG_PACK ? "pack" : "unpack");
         }
 
         code |= modifier;
@@ -581,9 +592,8 @@
       /* look for count and/or / */ 
       if (patptr < patend) {
 	if (isDIGIT(*patptr)) {
- 	  patptr = get_num( patptr, &symptr->length );
+ 	  patptr = get_num(patptr, &symptr->length);
           symptr->howlen = e_number;
-
         } else if (*patptr == '*') {
           patptr++;
           symptr->howlen = e_star;
@@ -591,11 +600,11 @@
         } else if (*patptr == '[') {
           char* lenptr = ++patptr;            
           symptr->howlen = e_number;
-          patptr = group_end( patptr, patend, ']' ) + 1;
+          patptr = group_end( patptr, patend, ']') + 1;
           /* what kind of [] is it? */
           if (isDIGIT(*lenptr)) {
-            lenptr = get_num( lenptr, &symptr->length );
-            if( *lenptr != ']' )
+            lenptr = get_num( lenptr, &symptr->length);
+            if (*lenptr != ']')
               Perl_croak(aTHX_ "Malformed integer in [] in %s",
                          symptr->flags & FLAG_PACK ? "pack" : "unpack");
           } else {
@@ -627,7 +636,7 @@
               if (patptr < patend &&
                   (isDIGIT(*patptr) || *patptr == '*' || *patptr == '['))
                 Perl_croak(aTHX_ "'/' does not take a repeat count in %s",
-                           symptr->flags & FLAG_PACK ? "pack" : "unpack" );
+                           symptr->flags & FLAG_PACK ? "pack" : "unpack");
             }
             break;
 	  }
@@ -648,6 +657,44 @@
 }
 
 /*
+   There is no way to cleanly handle the case where we should process the 
+   string per octet in its upgraded form while it's really in downgraded form
+   (e.g. estimates like strend-s as an upper bound for the number of 
+   characters left wouldn't work). So if we foresee the need of this 
+   (pattern starts with U or contains U0), we want to work on the encoded 
+   version of the string. Users are advised to upgrade their pack string 
+   themselves if they need to do a lot of unpacks like this on it
+*/
+STATIC bool 
+need_utf8(const char *pat, const char *patend)
+{
+    bool first = TRUE;
+    while (pat < patend) {
+	if (pat[0] == '#') {
+	    pat++;
+	    pat = memchr(pat, '\n', patend-pat);
+	    if (!pat) return FALSE;
+	} else if (pat[0] == 'U') {
+	    if (first || pat[1] == '0') return TRUE;
+	} else first = FALSE;
+	pat++;
+    }
+    return FALSE;
+}
+
+STATIC char
+first_symbol(const char *pat, const char *patend) {
+    while (pat < patend) {
+	if (pat[0] != '#') return pat[0];
+	pat++;
+	pat = memchr(pat, '\n', patend-pat);
+	if (!pat) return 0;
+	pat++;
+    }
+    return 0;
+}
+
+/*
 =for apidoc unpack_str
 
 The engine implementing unpack() Perl function. Note: parameters strbeg, new_s
@@ -656,14 +703,34 @@
 =cut */
 
 I32
-Perl_unpack_str(pTHX_ char *pat, register char *patend, register char *s, char *strbeg, char *strend, char **new_s, I32 ocnt, U32 flags)
+Perl_unpack_str(pTHX_ char *pat, char *patend, char *s, char *strbeg, char *strend, char **new_s, I32 ocnt, U32 flags)
 {
     tempsym_t sym = { 0 };
+
+    if (flags & FLAG_UNPACK_DO_UTF8) flags |= FLAG_UNPACK_WAS_UTF8;
+    else if (need_utf8(pat, patend)) {
+	/* We probably should try to avoid this in case a scalar context call
+	   wouldn't get to the "U0" */
+	STRLEN len = strend - s;
+	s = bytes_to_utf8(s, &len);
+	SAVEFREEPV(s);
+	strend = s + len;
+	flags |= FLAG_UNPACK_DO_UTF8;
+    }
+
+    if (first_symbol(pat, patend) == 'U') {
+	/*
+	  if (!(flags & FLAG_UNPACK_DO_UTF8)) 
+	      Perl_croak(aTHX_ "U0 mode on a byte string");
+	  flags &= ~FLAG_UNPACK_PARSE_UTF8;
+	*/
+    } else if (flags & FLAG_UNPACK_DO_UTF8) flags |= FLAG_UNPACK_PARSE_UTF8;
+
     sym.patptr = pat;
     sym.patend = patend;
     sym.flags  = flags;
 
-    return unpack_rec(&sym, s, s, strend, NULL );
+    return unpack_rec(&sym, s, s, strend, NULL);
 }
 
 /*
@@ -676,77 +743,147 @@
 =cut */
 
 I32
-Perl_unpackstring(pTHX_ char *pat, register char *patend, register char *s, char *strend, U32 flags)
+Perl_unpackstring(pTHX_ char *pat, char *patend, char *s, char *strend, U32 flags)
 {
     tempsym_t sym = { 0 };
+
+    if (flags & FLAG_UNPACK_DO_UTF8) flags |= FLAG_UNPACK_WAS_UTF8;
+    else if (need_utf8(pat, patend)) {
+	/* We probably should try to avoid this in case a scalar context call
+	   wouldn't get to the "U0" */
+	STRLEN len = strend - s;
+	s = bytes_to_utf8(s, &len);
+	SAVEFREEPV(s);
+	strend = s + len;
+	flags |= FLAG_UNPACK_DO_UTF8;
+    }
+
+    if (first_symbol(pat, patend) == 'U') {
+	/*
+	  if (!(flags & FLAG_UNPACK_DO_UTF8)) 
+	      Perl_croak(aTHX_ "U0 mode on a byte string");
+	  flags &= ~FLAG_UNPACK_PARSE_UTF8;
+	*/
+    } else if (flags & FLAG_UNPACK_DO_UTF8) flags |= FLAG_UNPACK_PARSE_UTF8;
+
     sym.patptr = pat;
     sym.patend = patend;
     sym.flags  = flags;
 
-    return unpack_rec(&sym, s, s, strend, NULL );
+    return unpack_rec(&sym, s, s, strend, NULL);
+}
+
+STATIC unsigned char
+next_uni_byte(pTHX_ char **s, const char *end, I32 datumtype)
+{
+    UV val;
+    STRLEN retlen;
+    val =
+	UNI_TO_NATIVE(utf8n_to_uvuni(*s, end-*s, &retlen,
+				     ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY));
+    /* We try to process malformed UTF-8 as much as possible (preferrably with
+       warnings), but these two mean we make no progress in the string and
+       might enter an infinite loop */
+    if (retlen == (STRLEN) -1 || retlen == 0)
+	Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
+    if (val >= 0x100) Perl_croak(aTHX_ "'%c' applied to character value %"UVf,
+				 (int) datumtype, val);
+    *s += retlen;
+    return val;
+}
+
+STATIC bool
+next_uni_bytes(pTHX_ char **s, const char *end, char *buf, int buf_len)
+{
+    UV val;
+    STRLEN retlen;
+    char *from = *s;
+    int bad = 0;
+    U32 flags = ckWARN(WARN_UTF8) ? 
+	UTF8_CHECK_ONLY : (UTF8_CHECK_ONLY | UTF8_ALLOW_ANY);
+    for (;buf_len > 0; buf_len--) {
+	if (from >= end) return FALSE;
+	val = UNI_TO_NATIVE(utf8n_to_uvuni(from, end-from, &retlen, flags));
+	if (retlen == (STRLEN) -1 || retlen == 0) {
+	    from += UTF8SKIP(from);
+	    bad |= 1;
+	} else from += retlen;
+	if (val >= 0x100) {
+	    bad |= 2;
+	    val &= 0xff;
+	}
+	*(unsigned char *)buf++ = val;
+    }
+    /* We have enough characters for the buffer. Did we have problems ? */
+    if (bad) {
+	if (bad & 1) {
+	    /* Rewalk the string fragment while warning */
+	    char *ptr;
+	    flags = ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY;
+	    for (ptr = *s; ptr < from; ptr += UTF8SKIP(ptr))
+		utf8n_to_uvuni(ptr, end-ptr, &retlen, flags);
+	    if (from > end) from = end;
+	}
+	if ((bad & 2) && ckWARN(WARN_UNPACK))
+	    Perl_warner(aTHX_ packWARN(WARN_UNPACK),
+			"Character(s) wrapped in unpack");
+    }
+    *s = from;
+    return TRUE;
+}
+
+STATIC bool
+next_uni_uu(pTHX_ char **s, const char *end, I32 *out)
+{
+    UV val;
+    STRLEN retlen;
+    char *from = *s;
+    val = UNI_TO_NATIVE(utf8n_to_uvuni(*s, end-*s, &retlen, UTF8_CHECK_ONLY));
+    if (val >= 0x100 || !ISUUCHAR(val) || 
+	retlen == (STRLEN) -1 || retlen == 0) {
+	*out = 0;
+	return FALSE;
+    }
+    *out = PL_uudmap[val] & 077;
+    *s = from;
+    return TRUE;
 }
 
 STATIC
 I32
-S_unpack_rec(pTHX_ register tempsym_t* symptr, register char *s, char *strbeg, char *strend, char **new_s )
+S_unpack_rec(pTHX_ tempsym_t* symptr, char *s, char *strbeg, char *strend, char **new_s)
 {
     dSP;
+
     I32 datumtype;
-    register I32 len = 0;
-    register I32 bits = 0;
-    register char *str;
+    I32 len = 0;
     SV *sv;
     I32 start_sp_offset = SP - PL_stack_base;
     howlen_t howlen;
 
-    /* These must not be in registers: */
-    I16 ai16;
-    U16 au16;
     I32 ai32;
-    U32 au32;
-#ifdef HAS_QUAD
-    Quad_t aquad;
-    Uquad_t auquad;
-#endif
-#if SHORTSIZE != SIZE16
-    short ashort;
-    unsigned short aushort;
-#endif
-    int aint;
-    unsigned int auint;
     long along;
-#if LONGSIZE != SIZE32
-    unsigned long aulong;
-#endif
-    char *aptr;
-    float afloat;
-    double adouble;
-#if defined(HAS_LONG_DOUBLE) && defined(USE_LONG_DOUBLE)
-    long double aldouble;
-#endif
-    IV aiv;
-    UV auv;
-    NV anv;
 
     I32 checksum = 0;
-    UV cuv = 0;
-    NV cdouble = 0.0;
+    UV cuv;
+    NV cdouble;
     const int bits_in_uv = 8 * sizeof(cuv);
     char* strrelbeg = s;
-    bool beyond = FALSE;
+    bool beyond, literal;
     bool unpack_only_one = (symptr->flags & FLAG_UNPACK_ONLY_ONE) != 0;
+    bool utf8 = (symptr->flags & FLAG_UNPACK_PARSE_UTF8) ? 1 : 0;
 
     while (next_symbol(symptr)) {
         datumtype = symptr->code;
 	/* do first one only unless in list context
-	   / is implemented by unpacking the count, then poping it from the
+	   / is implemented by unpacking the count, then popping it from the
 	   stack, so must check that we're not in the middle of a /  */
-        if ( unpack_only_one
-	     && (SP - PL_stack_base == start_sp_offset + 1)
-	     && (datumtype != '/') )   /* XXX can this be omitted */
+        if (unpack_only_one &&
+	    (SP - PL_stack_base == start_sp_offset + 1) &&
+	    (datumtype != '/'))   /* XXX can this be omitted */
             break;
 
-        switch( howlen = symptr->howlen ){
+        switch (howlen = symptr->howlen) {
         case e_no_len:
 	case e_number:
 	    len = symptr->length;
@@ -755,13 +892,14 @@
 	    len = strend - strbeg;	/* long enough */          
 	    break;
         }
-
+	/* Means the length comes from the pattern, not through / */
+	literal = 1;	
       redo_switch:
         beyond = s >= strend;
 	switch(TYPE_NO_ENDIANNESS(datumtype)) {
 	default:
-	    Perl_croak(aTHX_ "Invalid type '%c' in unpack", (int)TYPE_NO_MODIFIERS(datumtype) );
-
+	    Perl_croak(aTHX_ "Invalid type '%c' in unpack",
+		       (int) TYPE_NO_MODIFIERS(datumtype));
 	case '%':
 	    if (howlen == e_no_len)
 		len = 16;		/* len is not specified */
@@ -770,9 +908,7 @@
 	    cdouble = 0;
 	    continue;
 	    break;
-	case '(':
-	{
-	    char *ss = s;		/* Move from register */
+	  case '(': {
             tempsym_t savsym = *symptr;
 	    U32 group_modifiers = TYPE_MODIFIERS(datumtype & ~symptr->flags);
 	    symptr->flags |= group_modifiers;
@@ -780,46 +916,95 @@
             symptr->level++;
 	    PUTBACK;
 	    while (len--) {
+		  if (utf8) symptr->flags |=  FLAG_UNPACK_PARSE_UTF8;
+		  else      symptr->flags &= ~FLAG_UNPACK_PARSE_UTF8;
   	        symptr->patptr = savsym.grpbeg;
- 	        unpack_rec(symptr, ss, strbeg, strend, &ss );
-                if (ss == strend && savsym.howlen == e_star)
+		  unpack_rec(symptr, s, strbeg, strend, &s);
+		  if (s == strend && savsym.howlen == e_star)
 		    break; /* No way to continue */
 	    }
 	    SPAGAIN;
-	    s = ss;
 	    symptr->flags &= ~group_modifiers;
             savsym.flags = symptr->flags;
             *symptr = savsym;
 	    break;
 	}
 	case '@':
+	    if (utf8) {
+		s = strrelbeg;
+		while (len > 0) {
+		    if (s >= strend)
+			Perl_croak(aTHX_ "'@' outside of string in unpack");
+		    s += UTF8SKIP(s);
+		    len--;
+		}
+		if (s > strend)
+		    Perl_croak(aTHX_ "'@' outside of string with malformed UTF-8 in unpack");
+	    } else {
 	    if (len > strend - strrelbeg)
 		Perl_croak(aTHX_ "'@' outside of string in unpack");
 	    s = strrelbeg + len;
+	    }
 	    break;
  	case 'X' | TYPE_IS_SHRIEKING:
  	    if (!len)			/* Avoid division by 0 */
  		len = 1;
- 	    len = (s - strbeg) % len;
+	    if (utf8) {
+		char *hop, *last;
+		I32 l;
+		for (l=len, hop = strbeg; hop < s; l++, hop += UTF8SKIP(hop))
+		    if (l == len) {
+			last = hop;
+			l = 0;
+		    }
+		s = last;
+		break;
+	    } else len = (s - strbeg) % len;
  	    /* FALL THROUGH */
 	case 'X':
+	    if (utf8) {
+		while (len > 0) {
+		    if (s <= strbeg)
+			Perl_croak(aTHX_ "'X' outside of string in unpack");
+		    while (UTF8_IS_CONTINUATION(*--s)) {
+			if (s <= strbeg)
+			    Perl_croak(aTHX_ "'X' outside of string in unpack");
+		    }
+		    len--;
+		}
+	    } else {
 	    if (len > s - strbeg)
-		Perl_croak(aTHX_ "'X' outside of string in unpack" );
+		    Perl_croak(aTHX_ "'X' outside of string in unpack");
 	    s -= len;
+	    }
 	    break;
  	case 'x' | TYPE_IS_SHRIEKING:
  	    if (!len)			/* Avoid division by 0 */
  		len = 1;
- 	    aint = (s - strbeg) % len;
- 	    if (aint)			/* Other portable ways? */
- 		len = len - aint;
- 	    else
- 		len = 0;
+	    if (utf8) {
+		char *hop = strbeg;
+		I32 l = 0;
+		for (hop = strbeg; hop < s; hop += UTF8SKIP(hop)) l++;
+		if (s != hop)
+		    Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
+		ai32 = l % len;
+	    } else ai32 = (s - strbeg) % len;
+	    if (ai32 == 0) break;
+	    len = len - ai32;
  	    /* FALL THROUGH */
 	case 'x':
+	    if (utf8) {
+		while (len>0) {
+		    if (s >= strend)
+			Perl_croak(aTHX_ "'x' outside of string in unpack");
+		    s += UTF8SKIP(s);
+		    len--;
+		}
+	    } else {
 	    if (len > strend - s)
 		Perl_croak(aTHX_ "'x' outside of string in unpack");
 	    s += len;
+	    }
 	    break;
 	case '/':
 	    Perl_croak(aTHX_ "'/' must follow a numeric type in unpack");
@@ -827,224 +1012,277 @@
 	case 'A':
 	case 'Z':
 	case 'a':
-	    if (len > strend - s)
-		len = strend - s;
-	    if (checksum)
-		goto uchar_checksum;
-	    sv = newSVpvn(s, len);
-	    if (len > 0 && (datumtype == 'A' || datumtype == 'Z')) {
-		aptr = s;	/* borrow register */
-		if (datumtype == 'Z') {	/* 'Z' strips stuff after first null */
-		    s = SvPVX(sv);
-		    while (*s)
-			s++;
-		    if (howlen == e_star) /* exact for 'Z*' */
-		        len = s - SvPVX(sv) + 1;
+	    if (checksum) goto uchar_checksum; /* Won't handle >= 256 right */
+	    if (utf8) {
+		I32 l;
+		char *hop;
+		for (l=len, hop=s; l>0; l--, hop += UTF8SKIP(hop)) {
+		    if (hop >= strend) {
+			if (hop > strend)
+			    Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
+			break;
 		}
-		else {		/* 'A' strips both nulls and spaces */
-		    s = SvPVX(sv) + len - 1;
-		    while (s >= SvPVX(sv) && (!*s || isSPACE(*s)))
-			s--;
-		    *++s = '\0';
 		}
-		SvCUR_set(sv, s - SvPVX(sv));
-		s = aptr;	/* unborrow register */
+		if (hop > strend)
+		    Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
+		len = hop - s;
+	    } else if (len > strend - s)
+		len = strend - s;
+	    if (datumtype == 'Z') {
+		/* 'Z' strips stuff after first null */
+		char *ptr;
+		for (ptr = s; ptr < strend; ptr++) if (*ptr == 0) break;
+		sv = newSVpvn(s, ptr-s);
+		if (howlen == e_star) /* exact for 'Z*' */
+		    len = ptr-s + (ptr != strend ? 1 : 0);
+	    } else if (datumtype == 'A') {
+		/* 'A' strips both nulls and spaces */
+		char *ptr;
+		for (ptr = s+len-1; ptr >= s; ptr--)
+		    if (*ptr != 0 && !isSPACE(*ptr)) break;
+		ptr++;
+		sv = newSVpvn(s, ptr-s);
+	    } else sv = newSVpvn(s, len);
+	    if (utf8) {
+		SvUTF8_on(sv);
+		/* Undo any upgrade done due to need_utf8() */
+		if (!(symptr->flags & FLAG_UNPACK_WAS_UTF8))
+		    sv_utf8_downgrade(sv, 0);
 	    }
-	    s += len;
 	    XPUSHs(sv_2mortal(sv));
+	    s += len;
 	    break;
 	case 'B':
-	case 'b':
+	  case 'b': {
+	      char *str;
+	      /* Preliminary length estimate, acceptable for utf8 too */
 	    if (howlen == e_star || len > (strend - s) * 8)
 		len = (strend - s) * 8;
 	    if (checksum) {
 		if (!PL_bitcount) {
-		    Newz(601, PL_bitcount, 256, char);
-		    for (bits = 1; bits < 256; bits++) {
-			if (bits & 1)	PL_bitcount[bits]++;
-			if (bits & 2)	PL_bitcount[bits]++;
-			if (bits & 4)	PL_bitcount[bits]++;
-			if (bits & 8)	PL_bitcount[bits]++;
-			if (bits & 16)	PL_bitcount[bits]++;
-			if (bits & 32)	PL_bitcount[bits]++;
-			if (bits & 64)	PL_bitcount[bits]++;
-			if (bits & 128)	PL_bitcount[bits]++;
+		      int bits;
+		      Newz(601, PL_bitcount, 0x100, char);
+		      for (bits = 1; bits < 0x100; bits++) {
+			  int b = bits;
+			  while (b) {
+			      if (b & 1) PL_bitcount[bits]++;
+			      b >>= 1;
+			  }
 		    }
 		}
+		  if (utf8) {
+		      while (len >= 8 && s < strend) {
+			  cuv += PL_bitcount[next_uni_byte(aTHX_ &s, strend, datumtype)];
+			  len -= 8;
+		      }
+		  } else {
 		while (len >= 8) {
 		    cuv += PL_bitcount[*(unsigned char*)s++];
 		    len -= 8;
 		}
-		if (len) {
-		    bits = *s;
+		  }
+		  if (len && s < strend) {
+		      unsigned char bits;
+		      bits = utf8 ? next_uni_byte(aTHX_ &s, strend, datumtype) : *s++;
 		    if (datumtype == 'b') {
 			while (len-- > 0) {
 			    if (bits & 1) cuv++;
 			    bits >>= 1;
 			}
-		    }
-		    else {
+		      } else {
 			while (len-- > 0) {
-			    if (bits & 128) cuv++;
+			      if (bits & 0x80) cuv++;
 			    bits <<= 1;
 			}
 		    }
 		}
 		break;
 	    }
-	    sv = NEWSV(35, len + 1);
-	    SvCUR_set(sv, len);
+	      sv = sv_2mortal(NEWSV(35, len ? len : 1));
 	    SvPOK_on(sv);
 	    str = SvPVX(sv);
 	    if (datumtype == 'b') {
-		aint = len;
-		for (len = 0; len < aint; len++) {
-		    if (len & 7)		/*SUPPRESS 595*/
-			bits >>= 1;
-		    else
-			bits = *s++;
-		    *str++ = '0' + (bits & 1);
-		}
+		  unsigned char bits;
+		  ai32 = len;
+		  for (len = 0; len < ai32; len++) {
+		      if (len & 7) bits >>= 1;
+		      else if (utf8) {
+			  if (s >= strend) break;
+			  bits = next_uni_byte(aTHX_ &s, strend, datumtype);
+		      } else bits = *(unsigned char *) s++;
+		      *str++ = bits & 1 ? '1' : '0';
 	    }
-	    else {
-		aint = len;
-		for (len = 0; len < aint; len++) {
-		    if (len & 7)
-			bits <<= 1;
-		    else
-			bits = *s++;
-		    *str++ = '0' + ((bits & 128) != 0);
+	      } else {
+		  unsigned char bits;
+		  ai32 = len;
+		  for (len = 0; len < ai32; len++) {
+		      if (len & 7) bits <<= 1;
+		      else if (utf8) {
+			  if (s >= strend) break;
+			  bits = next_uni_byte(aTHX_ &s, strend, datumtype);
+		      } else bits = *(unsigned char *) s++;
+		      *str++ = bits & 0x80 ? '1' : '0';
 		}
 	    }
 	    *str = '\0';
-	    XPUSHs(sv_2mortal(sv));
+	      SvCUR_set(sv, str - SvPVX(sv));
+	      XPUSHs(sv);
 	    break;
+	  }
 	case 'H':
-	case 'h':
+	  case 'h': {
+	      char *str;
+	      /* Preliminary length estimate, acceptable for utf8 too */
 	    if (howlen == e_star || len > (strend - s) * 2)
 		len = (strend - s) * 2;
-	    sv = NEWSV(35, len + 1);
-	    SvCUR_set(sv, len);
+	      sv = sv_2mortal(NEWSV(35, len ? len : 1));
 	    SvPOK_on(sv);
 	    str = SvPVX(sv);
 	    if (datumtype == 'h') {
-		aint = len;
-		for (len = 0; len < aint; len++) {
-		    if (len & 1)
-			bits >>= 4;
-		    else
-			bits = *s++;
+		  unsigned char bits;
+		  ai32 = len;
+		  for (len = 0; len < ai32; len++) {
+		      if (len & 1) bits >>= 4;
+		      else if (utf8) {
+			  if (s >= strend) break;
+			  bits = next_uni_byte(aTHX_ &s, strend, datumtype);
+		      } else bits = * (unsigned char *) s++;
 		    *str++ = PL_hexdigit[bits & 15];
 		}
-	    }
-	    else {
-		aint = len;
-		for (len = 0; len < aint; len++) {
-		    if (len & 1)
-			bits <<= 4;
-		    else
-			bits = *s++;
+	      } else {
+		  unsigned char bits;
+		  ai32 = len;
+		  for (len = 0; len < ai32; len++) {
+		      if (len & 1) bits <<= 4;
+		      else if (utf8) {
+			  if (s >= strend) break;
+			  bits = next_uni_byte(aTHX_ &s, strend, datumtype);
+		      } else bits = *(unsigned char *) s++;
 		    *str++ = PL_hexdigit[(bits >> 4) & 15];
 		}
 	    }
 	    *str = '\0';
-	    XPUSHs(sv_2mortal(sv));
+	      SvCUR_set(sv, str - SvPVX(sv));
+	      XPUSHs(sv);
 	    break;
+	  }
 	case 'c':
-	    if (len > strend - s)
-		len = strend - s;
+	    /* Preliminary length estimate, acceptable for utf8 too */
+	    if (len > strend - s) len = strend - s;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
-	    while (len-- > 0) {
-		aint = *s++;
-		if (aint >= 128)	/* fake up signed chars */
-		    aint -= 256;
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSViv((IV)aint)));
+	    if (utf8) {
+		while (len-- > 0 && s < strend) {
+		    ai32 = next_uni_byte(aTHX_ &s, strend, 'c');
+		    if (ai32 >= 128)	/* fake up signed chars */
+			ai32 -= 256;
+		    if (!checksum)
+			PUSHs(sv_2mortal(newSViv((IV)ai32)));
+		    else if (checksum > bits_in_uv)
+			cdouble += (NV)ai32;
+		    else
+			cuv += ai32;
 		}
+	    } else {
+		while (len-- > 0) {
+		    ai32 = *s++;
+		    if (ai32 >= 128)	/* fake up signed chars */
+			ai32 -= 256;
+		    if (!checksum)
+			PUSHs(sv_2mortal(newSViv((IV)ai32)));
 		else if (checksum > bits_in_uv)
-		    cdouble += (NV)aint;
+			cdouble += (NV) ai32;
 		else
-		    cuv += aint;
+			cuv += ai32;
+		}
 	    }
 	    break;
 	case 'C':
-	unpack_C: /* unpack U will jump here if not UTF-8 */
             if (len == 0) {
-                symptr->flags &= ~FLAG_UNPACK_DO_UTF8;
+		if (literal)
+		    /* Switch to "natural" mode */
+		    utf8 = (symptr->flags & FLAG_UNPACK_DO_UTF8) ? 1 : 0;
 		break;
 	    }
-	    if (len > strend - s)
-		len = strend - s;
-	    if (checksum) {
 	      uchar_checksum:
-		while (len-- > 0) {
-		    auint = *s++ & 255;
-		    cuv += auint;
-		}
-	    }
-	    else {
-                if (len && unpack_only_one)
-                    len = 1;
+	    /* Preliminary length estimate, acceptable for utf8 too */
+	    if (len > strend - s) len = strend - s;
+	    if (!checksum) {
+		if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
+	    }
+	    if (utf8) {
+		while (len-- > 0 && s < strend) {
+		    /* Change this to unicode char fetch once all unpack("C")
+		       bugs have been flushed out */
+		    unsigned char ch = next_uni_byte(aTHX_ &s, strend, 'C');
+		    if (!checksum)
+			PUSHs(sv_2mortal(newSVuv((UV) ch)));
+		    else if (checksum > bits_in_uv)
+			cdouble += (NV) ch;
+		    else 
+			cuv += ch;
+		}
+	    } else if (!checksum) {
 		while (len-- > 0) {
-		    auint = *s++ & 255;
-		    PUSHs(sv_2mortal(newSViv((IV)auint)));
+		    unsigned char ch = *(unsigned char *) s++;
+		    PUSHs(sv_2mortal(newSVuv((UV) ch)));
 		}
+	    } else if (checksum > bits_in_uv) {
+		while (len-- > 0) cdouble += (NV) *(unsigned char *) s++;
+	    } else {
+		while (len-- > 0) cuv += *(unsigned char *) s++;
 	    }
 	    break;
 	case 'U':
 	    if (len == 0) {
-                symptr->flags |= FLAG_UNPACK_DO_UTF8;
+		if (literal) {
+		    /* Switch to "bytes in utf-8" mode */
+		    if (symptr->flags & FLAG_UNPACK_DO_UTF8) utf8 = 0;
+		    else
+			/* Should be impossible due to the need_utf8() test */
+			Perl_croak(aTHX_ "U0 mode on a byte string");
+		}
 		break;
 	    }
-	    if ((symptr->flags & FLAG_UNPACK_DO_UTF8) == 0)
-		 goto unpack_C;
-	    if (len > strend - s)
-		len = strend - s;
+	    if (len > strend - s) len = strend - s;
 	    if (!checksum) {
-		if (len && unpack_only_one)
-                    len = 1;
+		if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0 && s < strend) {
-		STRLEN alen;
-		auint = NATIVE_TO_UNI(utf8n_to_uvchr((U8*)s, strend - s, &alen, ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANYUV));
-		along = alen;
-		s += along;
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSVuv((UV)auint)));
-		}
+		STRLEN retlen;
+		UV auv = utf8n_to_uvuni((U8*)s, strend - s, &retlen, ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANYUV);
+		if (retlen == (STRLEN) -1 || retlen == 0)
+		    Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
+		s += retlen;
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVuv((UV) auv)));
 		else if (checksum > bits_in_uv)
-		    cdouble += (NV)auint;
+		    cdouble += (NV) auv;
 		else
-		    cuv += auint;
+		    cuv += auv;
 	    }
 	    break;
 	case 's' | TYPE_IS_SHRIEKING:
 #if SHORTSIZE != SIZE16
 	    along = (strend - s) / sizeof(short);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		COPYNN(s, &ashort, sizeof(short));
-		DO_BO_UNPACK(ashort, s);
-		s += sizeof(short);
-		if (!checksum) {
+		short ashort;
+		COPYVAR(s, strend, utf8, ashort, s);
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)ashort)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)ashort;
 		else
@@ -1056,25 +1294,31 @@
 #endif
 	case 's':
 	    along = (strend - s) / SIZE16;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		I16 ai16;
+
+#if U16SIZE > SIZE16
+		ai16 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, 
+					OFF16(&ai16), SIZE16)) break;
+		} else {
 		COPY16(s, &ai16);
+		    s += SIZE16;
+		}
 		DO_BO_UNPACK(ai16, 16);
 #if U16SIZE > SIZE16
-		if (ai16 > 32767)
-		    ai16 -= 65536;
+		if (ai16 > 32767) ai16 -= 65536;
 #endif
-		s += SIZE16;
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)ai16)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)ai16;
 		else
@@ -1084,21 +1328,17 @@
 	case 'S' | TYPE_IS_SHRIEKING:
 #if SHORTSIZE != SIZE16
 	    along = (strend - s) / sizeof(unsigned short);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-		if (len && unpack_only_one)
-                    len = 1;
+		if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		COPYNN(s, &aushort, sizeof(unsigned short));
-		DO_BO_UNPACK(aushort, s);
-		s += sizeof(unsigned short);
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSViv((UV)aushort)));
-		}
+		unsigned short aushort;
+		COPYVAR(s, strend, utf8, aushort, s);
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVuv((UV) aushort)));
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)aushort;
 		else
@@ -1112,31 +1352,41 @@
 	case 'n':
 	case 'S':
 	    along = (strend - s) / SIZE16;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		U16 au16;
+#if U16SIZE > SIZE16
+		au16 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, 
+					OFF16(&au16), SIZE16)) break;
+		} else {
 		COPY16(s, &au16);
-		DO_BO_UNPACK(au16, 16);
 		s += SIZE16;
-#ifdef HAS_NTOHS
+		}
+		DO_BO_UNPACK(au16, 16);
 		if (datumtype == 'n')
+#ifdef HAS_NTOHS
 		    au16 = PerlSock_ntohs(au16);
+#else
+		Perl_croak(aTHX_ "'n' not supported on this platform");
 #endif
-#ifdef HAS_VTOHS
 		if (datumtype == 'v')
+#ifdef HAS_VTOHS
 		    au16 = vtohs(au16);
+#else
+		Perl_croak(aTHX_ "'v' not supported on this platform");
 #endif
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSViv((UV)au16)));
-		}
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVuv((UV)au16)));
 		else if (checksum > bits_in_uv)
-		    cdouble += (NV)au16;
+		    cdouble += (NV) au16;
 		else
 		    cuv += au16;
 	    }
@@ -1144,30 +1394,40 @@
 	case 'v' | TYPE_IS_SHRIEKING:
 	case 'n' | TYPE_IS_SHRIEKING:
 	    along = (strend - s) / SIZE16;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		I16 ai16;
+#if U16SIZE > SIZE16
+		ai16 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend,
+					(char *) &ai16, sizeof(ai16))) break;
+		} else {
 		COPY16(s, &ai16);
 		s += SIZE16;
-#ifdef HAS_NTOHS
+		}
 		if (datumtype == ('n' | TYPE_IS_SHRIEKING))
-		    ai16 = (I16)PerlSock_ntohs((U16)ai16);
+#ifdef HAS_NTOHS
+		    ai16 = (I16) PerlSock_ntohs((U16) ai16);
+#else
+                    Perl_croak(aTHX_ "'n!' not supported on this platform");
 #endif
-#ifdef HAS_VTOHS
 		if (datumtype == ('v' | TYPE_IS_SHRIEKING))
-		    ai16 = (I16)vtohs((U16)ai16);
+#ifdef HAS_VTOHS
+		    ai16 = (I16) vtohs((U16) ai16);
+#else
+                    Perl_croak(aTHX_ "'v!' not supported on this platform");
 #endif
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)ai16)));
-		}
 		else if (checksum > bits_in_uv)
-		    cdouble += (NV)ai16;
+		    cdouble += (NV) ai16;
 		else
 		    cuv += ai16;
 	    }
@@ -1175,18 +1435,15 @@
 	case 'i':
 	case 'i' | TYPE_IS_SHRIEKING:
 	    along = (strend - s) / sizeof(int);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		Copy(s, &aint, 1, int);
-		DO_BO_UNPACK(aint, i);
-		s += sizeof(int);
+		int aint;
+		COPYVAR(s, strend, utf8, aint, i);
 		if (!checksum) {
 		    PUSHs(sv_2mortal(newSViv((IV)aint)));
 		}
@@ -1199,18 +1456,15 @@
 	case 'I':
 	case 'I' | TYPE_IS_SHRIEKING:
 	    along = (strend - s) / sizeof(unsigned int);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		Copy(s, &auint, 1, unsigned int);
-		DO_BO_UNPACK(auint, i);
-		s += sizeof(unsigned int);
+		unsigned int auint;
+		COPYVAR(s, strend, utf8, auint, i);
 		if (!checksum) {
 		    PUSHs(sv_2mortal(newSVuv((UV)auint)));
 		}
@@ -1222,27 +1476,25 @@
 	    break;
 	case 'j':
 	    along = (strend - s) / IVSIZE;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		Copy(s, &aiv, 1, IV);
+		IV aiv;
 #if IVSIZE == INTSIZE
-		DO_BO_UNPACK(aiv, i);
+		COPYVAR(s, strend, utf8, aiv, i);
 #elif IVSIZE == LONGSIZE
-		DO_BO_UNPACK(aiv, l);
+		COPYVAR(s, strend, utf8, aiv, l);
 #elif defined(HAS_QUAD) && IVSIZE == U64SIZE
-		DO_BO_UNPACK(aiv, 64);
+		COPYVAR(s, strend, utf8, aiv, 64);
+#else
+		Perl_croak(aTHX_ "'j' not supported on this platform");
 #endif
-		s += IVSIZE;
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv(aiv)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)aiv;
 		else
@@ -1260,18 +1512,18 @@
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		Copy(s, &auv, 1, UV);
-#if UVSIZE == INTSIZE
-		DO_BO_UNPACK(auv, i);
-#elif UVSIZE == LONGSIZE
-		DO_BO_UNPACK(auv, l);
-#elif defined(HAS_QUAD) && UVSIZE == U64SIZE
-		DO_BO_UNPACK(auv, 64);
+		UV auv;
+#if IVSIZE == INTSIZE
+		COPYVAR(s, strend, utf8, auv, i);
+#elif IVSIZE == LONGSIZE
+		COPYVAR(s, strend, utf8, auv, l);
+#elif defined(HAS_QUAD) && IVSIZE == U64SIZE
+		COPYVAR(s, strend, utf8, auv, 64);
+#else
+		Perl_croak(aTHX_ "'J' not supported on this platform");
 #endif
-		s += UVSIZE;
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSVuv(auv)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)auv;
 		else
@@ -1281,21 +1533,17 @@
 	case 'l' | TYPE_IS_SHRIEKING:
 #if LONGSIZE != SIZE32
 	    along = (strend - s) / sizeof(long);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		COPYNN(s, &along, sizeof(long));
-		DO_BO_UNPACK(along, l);
-		s += sizeof(long);
-		if (!checksum) {
+		long along;
+		COPYVAR(s, strend, utf8, along, l);
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)along)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)along;
 		else
@@ -1307,25 +1555,30 @@
 #endif
 	case 'l':
 	    along = (strend - s) / SIZE32;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		I32 ai32;
+#if U32SIZE > SIZE32
+		ai32 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, 
+					OFF32(&ai32), SIZE32)) break;
+		} else {
 		COPY32(s, &ai32);
+		    s += SIZE32;
+		}
 		DO_BO_UNPACK(ai32, 32);
 #if U32SIZE > SIZE32
-		if (ai32 > 2147483647)
-		    ai32 -= 4294967296;
+		if (ai32 > 2147483647) ai32 -= 4294967296;
 #endif
-		s += SIZE32;
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)ai32)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)ai32;
 		else
@@ -1344,12 +1597,10 @@
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		COPYNN(s, &aulong, sizeof(unsigned long));
-		DO_BO_UNPACK(aulong, l);
-		s += sizeof(unsigned long);
-		if (!checksum) {
+		unsigned long aulog;
+		COPYVAR(s, strend, utf8, aulong, l);
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSVuv((UV)aulong)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)aulong;
 		else
@@ -1363,29 +1614,39 @@
 	case 'N':
 	case 'L':
 	    along = (strend - s) / SIZE32;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		U32 au32;
+#if U32SIZE > SIZE32
+		au32 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, 
+					OFF32(&au32), SIZE32)) break;
+		} else {
 		COPY32(s, &au32);
-		DO_BO_UNPACK(au32, 32);
 		s += SIZE32;
-#ifdef HAS_NTOHL
+		}
+		DO_BO_UNPACK(au32, 32);
 		if (datumtype == 'N')
+#ifdef HAS_NTOHL
 		    au32 = PerlSock_ntohl(au32);
+#else
+		    Perl_croak(aTHX_ "'N' not supported on this platform");
 #endif
-#ifdef HAS_VTOHL
 		if (datumtype == 'V')
+#ifdef HAS_VTOHL
 		    au32 = vtohl(au32);
+#else
+		    Perl_croak(aTHX_ "'V' not supported on this platform");
 #endif
-		 if (!checksum) {
+		if (!checksum)
 		     PUSHs(sv_2mortal(newSVuv((UV)au32)));
-		 }
 		 else if (checksum > bits_in_uv)
 		     cdouble += (NV)au32;
 		 else
@@ -1395,28 +1656,38 @@
 	case 'V' | TYPE_IS_SHRIEKING:
 	case 'N' | TYPE_IS_SHRIEKING:
 	    along = (strend - s) / SIZE32;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		I32 ai32;
+#if U32SIZE > SIZE32
+		ai32 = 0;
+#endif
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, 
+					OFF32(&ai32), SIZE32)) break;
+		} else {
 		COPY32(s, &ai32);
 		s += SIZE32;
-#ifdef HAS_NTOHL
+		}
 		if (datumtype == ('N' | TYPE_IS_SHRIEKING))
+#ifdef HAS_NTOHL
 		    ai32 = (I32)PerlSock_ntohl((U32)ai32);
+#else
+		    Perl_croak(aTHX_ "'N!' not supported on this platform");
 #endif
-#ifdef HAS_VTOHL
 		if (datumtype == ('V' | TYPE_IS_SHRIEKING))
+#ifdef HAS_VTOHL
 		    ai32 = (I32)vtohl((U32)ai32);
+#else
+		    Perl_croak(aTHX_ "'V!' not supported on this platform");
 #endif
-		if (!checksum) {
+		if (!checksum)
 		    PUSHs(sv_2mortal(newSViv((IV)ai32)));
-		}
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)ai32;
 		else
@@ -1425,48 +1696,54 @@
 	    break;
 	case 'p':
 	    along = (strend - s) / sizeof(char*);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    EXTEND(SP, len);
 	    EXTEND_MORTAL(len);
 	    while (len-- > 0) {
-		if (sizeof(char*) > strend - s)
-		    break;
-		else {
+		char *aptr;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend,
+					(char *) &aptr, sizeof(aptr))) break;
+		} else {
 		    Copy(s, &aptr, 1, char*);
-		    DO_BO_UNPACK_P(aptr);
-		    s += sizeof(char*);
+		    s += sizeof(aptr);
 		}
+		DO_BO_UNPACK_P(aptr);
 		/* newSVpv generates undef if aptr is NULL */
 		PUSHs(sv_2mortal(newSVpv(aptr, 0)));
 	    }
 	    break;
 	case 'w':
-            if (len && unpack_only_one)
-                len = 1;
+            if (len && unpack_only_one) len = 1;
 	    EXTEND(SP, len);
 	    EXTEND_MORTAL(len);
 	    {
 		UV auv = 0;
 		U32 bytes = 0;
 		
-		while ((len > 0) && (s < strend)) {
-		    auv = (auv << 7) | (*s & 0x7f);
+		while (len > 0 && s < strend) {
+		    unsigned char ch;
+		    ch = utf8 ? next_uni_byte(aTHX_ &s, strend, 'w') : 
+			*(unsigned char *)s++;
+		    auv = (auv << 7) | (ch & 0x7f);
 		    /* UTF8_IS_XXXXX not right here - using constant 0x80 */
-		    if ((U8)(*s++) < 0x80) {
+		    if (ch < 0x80) {
 			bytes = 0;
 			PUSHs(sv_2mortal(newSVuv(auv)));
 			len--;
 			auv = 0;
+			continue;
 		    }
-		    else if (++bytes >= sizeof(UV)) {	/* promote to string */
+		    if (++bytes >= sizeof(UV)) {	/* promote to string */
 			char *t;
 			STRLEN n_a;
 
 			sv = Perl_newSVpvf(aTHX_ "%.*"UVf, (int)TYPE_DIGITS(UV), auv);
 			while (s < strend) {
-			    sv = mul128(sv, (U8)(*s & 0x7f));
-			    if (!(*s++ & 0x80)) {
+			    ch = utf8 ? next_uni_byte(aTHX_ &s, strend, 'w') :
+				*(unsigned char *)s++;
+			    sv = mul128(sv, (U8)(ch & 0x7f));
+			    if (!(ch & 0x80)) {
 				bytes = 0;
 				break;
 			    }
@@ -1488,36 +1765,36 @@
 	    if (symptr->howlen == e_star)
 	        Perl_croak(aTHX_ "'P' must have an explicit size in unpack");
 	    EXTEND(SP, 1);
-	    if (sizeof(char*) > strend - s)
-		break;
-	    else {
+	    if (sizeof(char*) > strend - s) break;
+	    {
+		char *aptr;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, (char *) &aptr, 
+					sizeof(aptr))) break;
+		} else {
 		Copy(s, &aptr, 1, char*);
-		DO_BO_UNPACK_P(aptr);
-		s += sizeof(char*);
+		    s += sizeof(aptr);
 	    }
+		DO_BO_UNPACK_P(aptr);
 	    /* newSVpvn generates undef if aptr is NULL */
 	    PUSHs(sv_2mortal(newSVpvn(aptr, len)));
+	    }
 	    break;
 #ifdef HAS_QUAD
 	case 'q':
 	    along = (strend - s) / sizeof(Quad_t);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
                 EXTEND(SP, len);
                 EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		assert (s + sizeof(Quad_t) <= strend);
-		Copy(s, &aquad, 1, Quad_t);
-		DO_BO_UNPACK(aquad, 64);
-		s += sizeof(Quad_t);
-		if (!checksum) {
-                    PUSHs(sv_2mortal((aquad >= IV_MIN && aquad <= IV_MAX) ?
+		Quad_t aquad;
+		COPYVAR(s, strend, utf8, aquad, 64);
+		if (!checksum)
+                    PUSHs(sv_2mortal(aquad >= IV_MIN && aquad <= IV_MAX ?
 				     newSViv((IV)aquad) : newSVnv((NV)aquad)));
-                }
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)aquad;
 		else
@@ -1526,23 +1803,18 @@
 	    break;
 	case 'Q':
 	    along = (strend - s) / sizeof(Uquad_t);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
                 EXTEND(SP, len);
                 EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
-		assert (s + sizeof(Uquad_t) <= strend);
-		Copy(s, &auquad, 1, Uquad_t);
-		DO_BO_UNPACK(auquad, 64);
-		s += sizeof(Uquad_t);
-		if (!checksum) {
-		    PUSHs(sv_2mortal((auquad <= UV_MAX) ?
-				     newSVuv((UV)auquad) : newSVnv((NV)auquad)));
-		}
+		Uquad_t auquad;
+		COPYVAR(s, strend, utf8, auquad, 64);
+		if (!checksum)
+		    PUSHs(sv_2mortal(auquad <= UV_MAX ?
+				     newSVuv((UV)auquad):newSVnv((NV)auquad)));
 		else if (checksum > bits_in_uv)
 		    cdouble += (NV)auquad;
 		else
@@ -1553,90 +1825,99 @@
 	/* float and double added gnb@melba.bby.oz.au 22/11/89 */
 	case 'f':
 	    along = (strend - s) / sizeof(float);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		float afloat;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, (char *) &afloat, 
+					sizeof(afloat))) break;
+		} else {
 		Copy(s, &afloat, 1, float);
-		DO_BO_UNPACK_N(afloat, float);
 		s += sizeof(float);
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSVnv((NV)afloat)));
 		}
-		else {
+		DO_BO_UNPACK_N(afloat, float);
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVnv((NV)afloat)));
+		else
 		    cdouble += afloat;
 		}
-	    }
 	    break;
 	case 'd':
 	    along = (strend - s) / sizeof(double);
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		double adouble;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, (char *) &adouble, 
+					sizeof(adouble))) break;
+		} else {
 		Copy(s, &adouble, 1, double);
-		DO_BO_UNPACK_N(adouble, double);
 		s += sizeof(double);
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSVnv((NV)adouble)));
 		}
-		else {
+		DO_BO_UNPACK_N(adouble, double);
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVnv((NV)adouble)));
+		else
 		    cdouble += adouble;
 		}
-	    }
 	    break;
 	case 'F':
 	    along = (strend - s) / NVSIZE;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		NV anv;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend,
+					(char *) &anv, sizeof(anv))) break;
+		} else {
 		Copy(s, &anv, 1, NV);
-		DO_BO_UNPACK_N(anv, NV);
 		s += NVSIZE;
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSVnv(anv)));
 		}
-		else {
+		DO_BO_UNPACK_N(anv, NV);
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVnv(anv)));
+		else
 		    cdouble += anv;
 		}
-	    }
 	    break;
 #if defined(HAS_LONG_DOUBLE) && defined(USE_LONG_DOUBLE)
 	case 'D':
 	    along = (strend - s) / LONG_DOUBLESIZE;
-	    if (len > along)
-		len = along;
+	    if (len > along) len = along;
 	    if (!checksum) {
-                if (len && unpack_only_one)
-                    len = 1;
+                if (len && unpack_only_one) len = 1;
 		EXTEND(SP, len);
 		EXTEND_MORTAL(len);
 	    }
 	    while (len-- > 0) {
+		long double aldouble;
+		if (utf8) {
+		    if (!next_uni_bytes(aTHX_ &s, strend, (char *) &aldouble, 
+					sizeof(aldouble))) break;
+		} else {
 		Copy(s, &aldouble, 1, long double);
-		DO_BO_UNPACK_N(aldouble, long double);
 		s += LONG_DOUBLESIZE;
-		if (!checksum) {
-		    PUSHs(sv_2mortal(newSVnv((NV)aldouble)));
-		}
-		else {cdouble += aldouble;
 		}
+		DO_BO_UNPACK_N(aldouble, long double);
+		if (!checksum)
+		    PUSHs(sv_2mortal(newSVnv((NV)aldouble)));
+		else
+		    cdouble += aldouble;
 	    }
 	    break;
 #endif
@@ -1660,8 +1941,34 @@
 
 	    along = (strend - s) * 3 / 4;
 	    sv = NEWSV(42, along);
-	    if (along)
-		SvPOK_on(sv);
+	    if (along) SvPOK_on(sv);
+	    if (utf8) {
+		while (next_uni_uu(aTHX_ &s, strend, &len)) {
+		    I32 a, b, c, d;
+		    char hunk[4];
+
+		    hunk[3] = '\0';
+		    while (len > 0) {
+			next_uni_uu(aTHX_ &s, strend, &a);
+			next_uni_uu(aTHX_ &s, strend, &b);
+			next_uni_uu(aTHX_ &s, strend, &c);
+			next_uni_uu(aTHX_ &s, strend, &d);
+			hunk[0] = (char)((a << 2) | (b >> 4));
+			hunk[1] = (char)((b << 4) | (c >> 2));
+			hunk[2] = (char)((c << 6) | d);
+			sv_catpvn(sv, hunk, (len > 3) ? 3 : len);
+			len -= 3;
+		    }
+		    if (s < strend) {
+			if (*s == '\n') s++;
+			else {
+			    /* possible checksum byte */
+			    char *skip = s+UTF8SKIP(s);
+			    if (skip < strend && *skip == '\n') s = skip+1;
+			}
+		    }
+		}
+	    } else {
 	    while (s < strend && *s > ' ' && ISUUCHAR(*s)) {
 		I32 a, b, c, d;
 		char hunk[4];
@@ -1697,6 +2004,7 @@
 		    if (s + 1 < strend && s[1] == '\n')
 		        s += 2;
 	    }
+	    }
 	    XPUSHs(sv_2mortal(sv));
 	    break;
 	}
@@ -1704,20 +2012,18 @@
 	if (checksum) {
 	    if (strchr("fFdD", TYPE_NO_MODIFIERS(datumtype)) ||
 	      (checksum > bits_in_uv &&
-	       strchr("csSiIlLnNUvVqQjJ", TYPE_NO_MODIFIERS(datumtype))) ) {
-		NV trouble;
+	       strchr("cCsSiIlLnNUvVqQjJ", TYPE_NO_MODIFIERS(datumtype)))) {
+		NV trouble, anv;
 
-                adouble = (NV) (1 << (checksum & 15));
+                anv = (NV) (1 << (checksum & 15));
 		while (checksum >= 16) {
 		    checksum -= 16;
-		    adouble *= 65536.0;
+		    anv *= 65536.0;
 		}
-		while (cdouble < 0.0)
-		    cdouble += adouble;
-		cdouble = Perl_modf(cdouble / adouble, &trouble) * adouble;
+		while (cdouble < 0.0) cdouble += anv;
+		cdouble = Perl_modf(cdouble / anv, &trouble) * anv;
 		sv = newSVnv(cdouble);
-	    }
-	    else {
+	    } else {
 		if (checksum < bits_in_uv) {
 		    UV mask = ((UV)1 << checksum) - 1;
 		    cuv &= mask;
@@ -1728,31 +2034,30 @@
 	    checksum = 0;
 	}
     
-        if (symptr->flags & FLAG_SLASH){
+        if (symptr->flags & FLAG_SLASH) {
             if (SP - PL_stack_base - start_sp_offset <= 0)
                 Perl_croak(aTHX_ "'/' must follow a numeric type in unpack");
-            if( next_symbol(symptr) ){
-              if( symptr->howlen == e_number )
-		Perl_croak(aTHX_ "Count after length/code in unpack" );
-              if( beyond ){
+            if (next_symbol(symptr)) {
+		if (symptr->howlen == e_number)
+		    Perl_croak(aTHX_ "Count after length/code in unpack");
+		if (beyond) {
          	/* ...end of char buffer then no decent length available */
-		Perl_croak(aTHX_ "length/code after end of string in unpack" );
+		    Perl_croak(aTHX_ "length/code after end of string in unpack");
               } else {
          	/* take top of stack (hope it's numeric) */
                 len = POPi;
-                if( len < 0 )
-                    Perl_croak(aTHX_ "Negative '/' count in unpack" );
-              }
-            } else {
-		Perl_croak(aTHX_ "Code missing after '/' in unpack" );
+		    if (len < 0)
+			Perl_croak(aTHX_ "Negative '/' count in unpack");
             }
+            } else
+		Perl_croak(aTHX_ "Code missing after '/' in unpack");
             datumtype = symptr->code;
+	    literal = 0;
 	    goto redo_switch;
         }
     }
 
-    if (new_s)
-	*new_s = s;
+    if (new_s) *new_s = s;
     PUTBACK;
     return SP - PL_stack_base - start_sp_offset;
 }
@@ -1764,32 +2069,24 @@
     I32 gimme = GIMME_V;
     STRLEN llen;
     STRLEN rlen;
-    register char *pat = SvPV(left, llen);
-#ifdef PACKED_IS_OCTETS
-    /* Packed side is assumed to be octets - so force downgrade if it
-       has been UTF-8 encoded by accident
-     */
-    register char *s = SvPVbyte(right, rlen);
-#else
-    register char *s = SvPV(right, rlen);
-#endif
+    char *pat = SvPV(left,  llen);
+    char *s   = SvPV(right, rlen);
     char *strend = s + rlen;
-    register char *patend = pat + llen;
-    register I32 cnt;
+    char *patend = pat + llen;
+    I32 cnt;
 
     PUTBACK;
     cnt = unpackstring(pat, patend, s, strend,
-		     ((gimme == G_SCALAR) ? FLAG_UNPACK_ONLY_ONE : 0)
-		     | (DO_UTF8(right) ? FLAG_UNPACK_DO_UTF8 : 0));
+		       (gimme == G_SCALAR ? FLAG_UNPACK_ONLY_ONE : 0) |
+		       (DO_UTF8(right)    ? FLAG_UNPACK_DO_UTF8  : 0));
 
     SPAGAIN;
-    if ( !cnt && gimme == G_SCALAR )
-       PUSHs(&PL_sv_undef);
+    if (!cnt && gimme == G_SCALAR) PUSHs(&PL_sv_undef);
     RETURN;
 }
 
 STATIC void
-S_doencodes(pTHX_ register SV *sv, register char *s, register I32 len)
+S_doencodes(pTHX_ SV *sv, char *s, I32 len)
 {
     char hunk[5];
 
@@ -1904,14 +2201,14 @@
 
 
 void
-Perl_pack_cat(pTHX_ SV *cat, char *pat, register char *patend, register SV **beglist, SV **endlist, SV ***next_in_list, U32 flags)
+Perl_pack_cat(pTHX_ SV *cat, char *pat, char *patend, SV **beglist, SV **endlist, SV ***next_in_list, U32 flags)
 {
     tempsym_t sym = { 0 };
     sym.patptr = pat;
     sym.patend = patend;
     sym.flags  = FLAG_PACK;
 
-    (void)pack_rec( cat, &sym, beglist, endlist );
+    (void)pack_rec( cat, &sym, beglist, endlist);
 }
 
 
@@ -1924,9 +2221,10 @@
 
 
 void
-Perl_packlist(pTHX_ SV *cat, char *pat, register char *patend, register SV **beglist, SV **endlist )
+Perl_packlist(pTHX_ SV *cat, char *pat, char *patend, SV **beglist, SV **endlist)
 {
     tempsym_t sym = { 0 };
+
     sym.patptr = pat;
     sym.patend = patend;
     sym.flags  = FLAG_PACK;
@@ -1931,17 +2229,17 @@
     sym.patend = patend;
     sym.flags  = FLAG_PACK;
 
-    (void)pack_rec( cat, &sym, beglist, endlist );
+    (void) pack_rec(cat, &sym, beglist, endlist);
 }
 
 
 STATIC
 SV **
-S_pack_rec(pTHX_ SV *cat, register tempsym_t* symptr, register SV **beglist, SV **endlist )
+S_pack_rec(pTHX_ SV *cat, tempsym_t* symptr, SV **beglist, SV **endlist)
 {
-    register I32 items;
+    I32 items;
     STRLEN fromlen;
-    register I32 len = 0;
+    I32 len = 0;
     SV *fromstr;
     /*SUPPRESS 442*/
     static char null10[] = {0,0,0,0,0,0,0,0,0,0};
@@ -1982,10 +2280,10 @@
     tempsym_t lookahead;
 
     items = endlist - beglist;
-    found = next_symbol( symptr );
+    found = next_symbol(symptr);
 
 #ifndef PACKED_IS_OCTETS
-    if (symptr->level == 0 && found && symptr->code == 'U' ){
+    if (symptr->level == 0 && found && symptr->code == 'U') {
 	SvUTF8_on(cat);
     }
 #endif
@@ -1997,7 +2295,7 @@
         I32 datumtype = symptr->code;
         howlen_t howlen;
 
-        switch( howlen = symptr->howlen ){
+        switch (howlen = symptr->howlen) {
         case e_no_len:
 	case e_number:
 	    len = symptr->length;
@@ -2010,10 +2308,10 @@
         /* Look ahead for next symbol. Do we have code/code? */
         lookahead = *symptr;
         found = next_symbol(&lookahead);
-	if ( symptr->flags & FLAG_SLASH ) {
-	    if (found){
- 	        if ( 0 == strchr( "aAZ", lookahead.code ) ||
-                     e_star != lookahead.howlen )
+	if ( symptr->flags & FLAG_SLASH) {
+	    if (found) {
+ 	        if ( 0 == strchr( "aAZ", lookahead.code) ||
+                     e_star != lookahead.howlen)
  		    Perl_croak(aTHX_ "'/' must be followed by 'a*', 'A*' or 'Z*' in pack");
 	        lengthcode = sv_2mortal(newSViv(sv_len(items > 0
 						   ? *beglist : &PL_sv_no)
@@ -2045,7 +2343,7 @@
             symptr->level++;
 	    while (len--) {
   	        symptr->patptr = savsym.grpbeg;
-		beglist = pack_rec(cat, symptr, beglist, endlist );
+		beglist = pack_rec(cat, symptr, beglist, endlist);
 		if (savsym.howlen == e_star && beglist == endlist)
 		    break;		/* No way to continue */
 	    }
@@ -2121,7 +2419,7 @@
 	case 'B':
 	case 'b':
 	    {
-		register char *str;
+		char *str;
 		I32 saveitems;
 
 		fromstr = NEXTFROM;
@@ -2177,7 +2475,7 @@
 	case 'H':
 	case 'h':
 	    {
-		register char *str;
+		char *str;
 		I32 saveitems;
 
 		fromstr = NEXTFROM;
@@ -2249,7 +2547,7 @@
 		    if ((aint < -128 || aint > 127) &&
 			ckWARN(WARN_PACK))
 		        Perl_warner(aTHX_ packWARN(WARN_PACK),
-				    "Character in 'c' format wrapped in pack" );
+				    "Character in 'c' format wrapped in pack");
 		    achar = aint & 255;
 		    sv_catpvn(cat, &achar, sizeof(char));
 		    break;
@@ -2722,10 +3021,10 @@
 PP(pp_pack)
 {
     dSP; dMARK; dORIGMARK; dTARGET;
-    register SV *cat = TARG;
+    SV *cat = TARG;
     STRLEN fromlen;
-    register char *pat = SvPVx(*++MARK, fromlen);
-    register char *patend = pat + fromlen;
+    char *pat = SvPVx(*++MARK, fromlen);
+    char *patend = pat + fromlen;
 
     MARK++;
     sv_setpvn(cat, "", 0);
diff -rubB perl-clean/t/op/pack.t perl-dev/t/op/pack.t
--- perl-clean/t/op/pack.t	Fri Jan 14 21:23:23 2005
+++ perl-dev/t/op/pack.t	Sat Jan 22 01:21:58 2005
@@ -835,7 +834,7 @@
 isnt(v1.20.300.4000, sprintf "%vd", pack("C0U*",1,20,300,4000));
 
 my $rslt = $Is_EBCDIC ? "156 67" : "199 162";
-is(join(" ", unpack("C*", chr(0x1e2))), $rslt);
+is(join(" ", unpack("U0C*", chr(0x1e2))), $rslt);
 
 # does pack U create Unicode?
 is(ord(pack('U', 300)), 300);
@@ -853,8 +852,8 @@
 SKIP: {
     skip "Not for EBCDIC", 4 if $Is_EBCDIC;
 
-    # does unpack C unravel pack U?
-    is("@{[unpack('C*', pack('U*', 100, 200))]}", "100 195 136");
+    # does unpack U0C unravel pack U?
+    is("@{[unpack('U0C*', pack('U*', 100, 200))]}", "100 195 136");
 
     # does pack U0C create Unicode?
     is("@{[pack('U0C*', 100, 195, 136)]}", v100.v200);
@@ -864,8 +863,9 @@
 
     # does unpack U0U on byte data warn?
     {
+        my $bad = pack("U0C", 255);
         local $SIG{__WARN__} = sub { $@ = "@_" };
-        my @null = unpack('U0U', chr(255));
+        my @null = unpack('U0U', $bad);
         like($@, qr/^Malformed UTF-8 character /);
     }
 }
diff -rubB perl-clean/t/uni/case.pl perl-dev/t/uni/case.pl
--- perl-clean/t/uni/case.pl	Fri Jan 21 22:44:38 2005
+++ perl-dev/t/uni/case.pl	Fri Jan 21 22:46:17 2005
@@ -64,7 +64,7 @@
 
     for my $i (sort keys %$spec) {
 	my $w = unidump($spec->{$i});
-	my $u = unpack "U0U", $i;
+	my $u = unpack "C0U", $i;
 	my $h = sprintf "%04X", $u;
 	my $c = chr($u); $c .= chr(0x100); chop $c;
 	my $d = $func->($c);

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About