develooper Front page | perl.perl5.porters | Postings from February 2000

[PATCH 5.5.64] allow 64-bit utf8

Thread Next
From:
Ilya Zakharevich
Date:
February 6, 2000 10:18
Subject:
[PATCH 5.5.64] allow 64-bit utf8
Message ID:
20000206131818.A20734@math.mps.ohio-state.edu
This patch extends the range of possible utf8 values to support 64-bit
integers (and some flags to make future expansion possible).

Explanation: *by definition*, UTF-8-encoded integers cannot start with
0xfe or 0xff (to allow byte-order marks?).  However, if the first byte
has bit 0x80 set, the second byte will never have bit 0x40 set, so
combinations 0xfe 0xff and 0xff 0xfe cannot appear in any
utf8-compatible string anyway.

Perl lifted this restriction, thus was allowing for more than 31-bit
integers to be encoded as utf8 (this is why we were not using the name
UTF-8 ;-).  The patch changes this extension if the first byte is
0xff, allowing 72 bits instead of the current 42 bits.

Thus on 32-bit machines the encoding is going to be the same (any
integer is encodable), and now the same is true on 64-bit machines too,
with no penalty whatsoever if not used.

Also: utf8-to-uv had a bug when the first byte is 0xff.

Enjoy,
Ilya

--- ./utf8.h-pre	Tue Feb  1 15:29:46 2000
+++ ./utf8.h	Sun Feb  6 12:25:08 2000
@@ -18,7 +18,8 @@ EXTCONST unsigned char PL_utf8skip[] = {
 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, /* bogus */
 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, /* bogus */
 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, /* scripts */
-3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,6,6,7,8, /* cjk etc. */
+3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,6,6,	 /* cjk etc. */
+7,13, /* Perl extended (not UTF-8).  Up to 72bit allowed (64-bit + reserved). */
 };
 #else
 EXTCONST unsigned char PL_utf8skip[];
--- ./utf8.c-pre	Thu Jan 27 17:53:28 2000
+++ ./utf8.c	Sun Feb  6 13:10:10 2000
@@ -84,6 +84,11 @@ Perl_uv_to_utf8(pTHX_ U8 *d, UV uv)
 #ifdef HAS_QUAD
     {
 	*d++ =                        0xff;	/* Can't match U+FFFE! */
+	*d++ =                        0x80;	/* 6 Reserved bits */
+	*d++ = (((uv >> 60) & 0x0f) | 0x80);	/* 2 Reserved bits */
+	*d++ = (((uv >> 54) & 0x3f) | 0x80);
+	*d++ = (((uv >> 48) & 0x3f) | 0x80);
+	*d++ = (((uv >> 42) & 0x3f) | 0x80);
 	*d++ = (((uv >> 36) & 0x3f) | 0x80);
 	*d++ = (((uv >> 30) & 0x3f) | 0x80);
 	*d++ = (((uv >> 24) & 0x3f) | 0x80);
@@ -120,8 +125,8 @@ Perl_utf8_to_uv(pTHX_ U8* s, I32* retlen
     else if (!(uv & 0x08))	{ len = 4; uv &= 0x07; }
     else if (!(uv & 0x04))	{ len = 5; uv &= 0x03; }
     else if (!(uv & 0x02))	{ len = 6; uv &= 0x01; }
-    else if (!(uv & 0x01))	{ len = 7; uv &= 0x00; }
-    else 			  len = 8;	/* whoa! */
+    else if (!(uv & 0x01))	{ len = 7;  uv = 0; }
+    else 			{ len = 13; uv = 0; } /* whoa! */
 
     if (retlen)
 	*retlen = len;

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About