Front page | perl.perl5.porters |
Postings from October 2015
RE: Any objections to warning on chr(-1)
From:
Karl Williamson
Date:
October 29, 2015 05:52
Subject:
RE: Any objections to warning on chr(-1)
Message ID:
5631B40A.3040205@khwilliamson.com
On June 20, 2011 13:04, Chris Hall wrote:
> Karl Williamson wrote (on Sun 26-June-2011 at 01:24 +0100):
> ....
> > I would not stand in somebody's way if they wanted to do this, but
> > I'm not willing to do all the work entailed, such as proving that
> > nobody has encoded a second byte to 0xFF that is not 0x80. One
> > complication is that in UTF-EBCDIC, that would happen around 2**31,
> > I think. I actually don't think there are any EBCDIC machines out
> > there running modern Perl in native locales, but Perl officially is
> > supposed to support them.
>
> While you are tidying up this area (so that it can actually be said to
> work), I think it would be a shame to leave the issue of "what is a Perl
> Character" ambiguous.
>
> Clearly the cheapest of all approaches is to declare the effective limit
> to be machine specific, but with outer limit of 72-bits for all time;
> and live with the existing 0xFF encoding for ever.
>
> ---------------------------------------
>
> Since I know nothing about how this is implemented, I can offer the
> following expert opinion :-)
>
> Assuming that all code that creates and reads the Perl-Extended-UTF-8,
> works from/to local machine integer, then the current limit must be
> 64-bits ? If so, one can assert that the first byte after the 0xFF on
> all sequences written/readable by Perl to date *must* be 0x80.
>
> Least work, but extensible, approach would then be to:
>
> a. assert that any value of the byte after 0xFF other
> than 0x80 is now *reserved* for future extension.
>
> Could define 64 bit unsigned to be the (current)
> outer limit, which implicitly limits the valid
> values for this byte (and also the ms 2 value
> bits of the next byte).
>
> b. throw a suitable invalid encoding wobbly if any
> (now) reserved value is read.
>
> I assume it already throws some wobbly if the
> value doesn't fit in a machine integer.
>
> ...so this is largely "definition engineering".
>
> At some time in the future, iff it is found necessary to go beyond 64
> bits, can then implement new-fangled sequences, with the byte(s) after
> the 0xFF as a count.
>
> The new-fangled sequences could provide shorter encodings for 37-60 bit
> values. However, that messes up string comparison: old-style 0xFF 0x80
> sequences do not sort correctly against new-fangled 0xFF 0x80+N ones.
> That is solvable, but requires special case handling -- including the
> need to know the index of the first mis-matching bytes -- all of which
> is only required for the most remote of fringe cases :-(
>
> IMHO anything beyond 31 bits is "exotic", so I find it difficult to give
> (damn > 0) how wasteful the encoding is ! So, sticking with the
> current, fixed length, 13 byte sequences for 37..66 bits is the
> straightforward solution.
>
> ---------------------------------------
>
> Nevertheless, the current 0xFF encoding does seem klunky :-( If there
> is any general -- beyond Perl "internal" use -- requirement to extend
> UTF-8 beyond 31 or 36 bits, then an encoding limited by an
> ("interesting") early design decision in Perl is unlikely to find favour
> elsewhere.
>
> So... it would be cleaner to legislate current 0xFF out of existence,
> and require anyone who (general broken-ness notwithstanding) has 0xFF
> sequences in files, to convert those files. Rationale: values > 31 bits
> have never worked terribly well (let alone > 36); it is now fixed, for
> now and into the future; BUT if you have actually managed to use 0xFF
> sequence and store those in files, then here is how to convert same
> (sorry). IMHO, the number of people who would be caught by this is
> trivial -- but I can think of no way of verifying that.
>
> As an intermediate step, could now set a default limit of 36 bits -- so
> that no 0xFF sequences are valid any more -- but provide a switch to
> override that and use current 0xFF sequences. Again, given the general
> broken-ness, this is not a big change, and is not irreversible. Anyone
> needing the override could then shout -- and it would become clear how
> many people would be troubled by a later complete withdrawal of current
> 0xFF sequences, and the introduction of new-fangled 0xFF sequences (for
> which 0xFF 0x80 would be the start of a 9 byte sequence for values
> 37..42 bits).
>
> Chris
>
I have now investigated this further. I now suspect that the reason
that a variable length scheme was not used initially is because it would
introduce a branch in a very commonly used construct. UTF8SKIP() is a
macro that tells how many bytes the next character occupies. It is
implemented as a simple array lookup of a 256 byte const array. This
would almost always be in the cache when doing UTF-8 processing, as it
is used all over the place.
I don't know what the effects nowadays of having a branch rarely taken
would be on the code. Probably the size gain would not be noticeable,
and branch prediction would say that the branch isn't taken, or it could
be inline. But it's still my guess as to why this was done this way
originally.
As to why its sized to allow a 72-byte code point, I don't know. It
makes some sense getting to 64; to accommodate 64-bit systems would take
12 UTF-8 bytes instead of 13. The payload is doubled, one start plus 12
data, so that may have some bearing, but what I don't know. Doing so,
though, does mean that there are fewer overlongs than otherwise. But if
that is a consideration they cared about or even thought about, I don't
know.
Changing things now introduces backwards compatibility issues. However,
I don't think this should be of real concern. I don't think such high
code points are used very much at all, and there is a default-off
warning raised whenever outputting a code point above Unicode. There
could for a time be a stronger, default-on, warning raised for these
very large code points.
I am not advocating for this change; I could be persuaded it is
worthwhile; I do think the pods should be changed to say that we reserve
the right to change the representation that gets written out for very
large code points, those having 0xFE and 0xFF start bytes. And it might
be a good idea to have a different warning raised when outputting these,
than the run-of-the-mill above-Unicode characters.
Each added byte adds 6 bits of information.
UTF-EBCDIC runs out of code point space at 2**31-1 without using the
trick that UTF-8 uses to get above 2**36. But there are now 64-bit
EBCDIC platforms, so I'm going to change our implementation of
UTF-EBCDIC to use the trick. A total of 14 bytes gets it to that (as
opposed to 13 total for UTF-8 to get to 2**72
-
RE: Any objections to warning on chr(-1)
by Karl Williamson