develooper Front page | perl.perl5.porters | Postings from April 2008

Re: [perl #51710] utf8::valid rejects characters in \x14_FFFF - \x1F_FFFF

Thread Previous | Thread Next
From:
Glenn Linderman
Date:
April 1, 2008 02:19
Subject:
Re: [perl #51710] utf8::valid rejects characters in \x14_FFFF - \x1F_FFFF
Message ID:
47F1B263.8040805@nevcal.com
On approximately 3/31/2008 9:31 AM, came the following characters from 
the keyboard of Chris Hall:
> [FWIW, this is not the only place where characters > 0x7FFF_FFFF are 
> FUBAR.  I would ditch the attempt to support anything beyond 
> 0x7FFF_FFFF before the world gets much older !]

I'm somewhat ambivalent here... supporting 64 bits could be useful on 64 
bit platforms; however, if there is ever a 128 bit platforms, there is 
no provision for further extension of the range of values.

The question is, what are the requirements? 

If it is for Unicode only, well, that was a moving target at 
implementation time.  Unicode was 16 bit, then, and ISO-646 was 32 bit, 
and no one knew which one would win.  Unicode seems to be a bit better 
defined now, but it also exceeds 16 bits now, so aren't we glad that we 
aren't locked into a 16-bit implementation?

If it is also for "v string" type things, that are not pure Unicode, but 
just lists of numbers, then where do you draw the line?  Musical 
metrical data has lots of numbers in a string, but they are generally 
small numbers, mostly < 20, so no problem.  IPv4 addresses are 4 bytes 
long, as are Windows version strings.  Dewey decimals are pretty easily 
stuck into strings also.  IPv6 addresses are longer, and the units are 
bigger.  I think they still fit.  What about a list of the number of 
words, or characters in each paragraph of the works of Shakespeare  (get 
those monkeys going on the typewriters)?  Gets long, but unlikely that 
the individual numbers would be too big for today's 72-bit range of 
values....

Now I have no clue what sorts of "stringification" or "canonicalization" 
is done by Storable, but Data::Dumper converts everything to ASCII 
text.  For numbers that's 3.5 bits per character.  If it were hex, you 
get 4 bits per character.  Base 64 gets you 6 bits per character.  But 
you have to encode length also... "extended UTF-8" gets you 6 bits per 
character, and a dynamic length.  Could be pretty handy for that sort of 
application, if the bounds of the numbers are sufficient (and not bound 
by Unicode semantics).  But then the bounds of the "stringified 
numerics" would definitely have to cover the bounds of the numeric 
integers for the platform.

So, what are the requirements?

<imagine this> (or skip to the end of it)

Had I been consulted in the original design of Perl's "extended UTF-8", 
I'm not sure what I'd have suggested back then, but today, if the goal 
is to have extended character ranges, I would make a different trade-off.

Today's implementation (if I understand it correctly):

* being strictly UTF-8 encoding compliant for values 0-0x7FFF_FFFF
* present Perl uses a prefix byte of 0xFE to indicate 36 bit numbers? (6 
more bytes containing 6 data bits each)
* present Perl uses a prefix byte of 0xFF to indicate 72 bit numbers? 
(12 more bytes containing 6 data bits each)

Today, I would suggest a design that is different, handles larger ranges 
of numbers, and doesn't have such a huge jump in size...  Changing only 
the third point to:

* use prefix byte 0xFF to indicate that the next data byte is the number 
of additional data bytes following, containing 6 data bits each in the 
lower 6 bits, with bit 8 set and bit 7 reset.  Since 0xFE already 
implies 6 more data bytes, one could get a little more range by 
suggesting that the value in the 2nd data byte when the first is 0xFF 
should not include the 6 already accounted for, nor itself, nor the "at 
least one more that meant that it wouldn't fit in 6 bytes", so that 
means that 0xFF80 would imply 7 additional data bytes following, and 
0xFFBF would imply 70 additional data bytes following.

This would allow for up to 420 data bits.  It would jump from 7 bytes to 
9 bytes when the 37th bit is needed, but would then grow 1 byte at a 
time with increasing magnitudes, rather than having a 6 byte jump when 
the 37th bit is needed, and topping out at 72 bits.

With the current design, I have no idea why 72 bits were chosen, rather 
than 66, which would seem to cover 64 bit integers well, and only use 12 
bytes instead of that odd 13 byte value.

With my proposed alternate design, it would also be possible to 
interpret the number of additional bytes in other ways than one extra 
byte at a time... a simple scheme could suggest adding 2 bytes at a 
time, jumping from 7 to 10 bytes when the 37th bit is needed, and then 
to 12 when the 49th bit is needed.  This would give a range up to 804 
bits.  A more complex scheme could have numbers from 0-31 mean 2 more 
bytes each, but the numbers 32-63 would mean 4 more bytes each... 
trading off the number of extra bytes for bigger range.  Or a lookup 
table, with each bigger value adding even more bytes.

</imagine this>


Working within the current implementation, however, I mostly agree with 
the following proposal... an extra parameter to specify what kind of 
validity you are requesting.  Not clear if the current behavior is 
useful enough to be a default.

I wonder if an additional optional parameter, containing the Unicode 
version of interest, would be appropriate, or if it should just be 
declared that this particular version of Perl implements this particular 
version of Unicode...

> It seems to me that there's a place for an optional argument for 
> utf8::valid, bitwise:
>
>    * reject > 0x10_FFFF      ) strict UTF-8
>    * reject 'surrogates'     )
>
>    * reject 'non-characters' ) for strict UTF-8, external exchange
>
>    * reject > 0x7FFF_FFFF    ) to filter out Perl's non-standard stuff
>
> and one could make a case for:
>
>    * reject private use      ) for external exchange (planes 14 & 15)
>                              ) and U+E000..U+F8FF
>
>    * reject reserved         ) for tidy-ness... but this could be
>                              ) limited to the "large" reserved areas (?)
>
>    * reject U+FFFD           ) for external exchange
>
>    * reject controls         ) for the paranoid -- excluding whitespace
>
>    * reject > 0x00_FFFF      ) i.e. accept BMP (& priv. use)
>    * reject > 0x01_FFFF      ) i.e. accept BMP & SMP (& priv. use)
>    * reject > 0x02_FFFF      ) i.e. accept BMP, SMP & SIP (& priv. use)
>
> Something along those lines, anyway.  [Filtering by plane could be 
> made more general.]
>
> There are probably too many scripts to make it feasible to filter down 
> to that level, here.  Besides, an unvoiced requirement is for this to 
> be a quick "first cut" scan.
>
> Return value could be byte offset of first rejected object -- -1 => 
> OK. So not a true/false return.
>
>> Thanks for the report.
>>
>> Steve

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About