Front page | perl.perl5.porters |
Postings from November 2008
[perl #58182] Unicode bug: code review request
Thread Next
From:
karl williamson
Date:
November 22, 2008 14:25
Subject:
[perl #58182] Unicode bug: code review request
Message ID:
49288696.1060902@khwilliamson.com
Attached for code review are changes to pp.c to enable case handling of
characters in the range 128-255. This is not a patch yet. Since I'm
new to changing the perl source, I wanted to get feedback before
submitting a real patch. Also, changes in a couple of other files
depend on some unresolved issues.
There are two attachments, one is in standard patch format. The other
is an html file containing another type of diff that I prefer. In it,
the cyan background is for deleted things, the yellow for added; changes
in line indentation are not shown.
I'm wondering what sorts of things I might be overlooking. I do know
that I don't know much about how overloading or magic might affect these
routines. I don't think my changes would affect these, but then, I
don't know much about these.
There are a couple of comments marked TODO, which means I have special
questions about them.
One of my design goals was to not slow things down unnecessarily. I
think that, if anything, I have sped things up. One area of concern I
have, though, in that regard is in the loop in pp_uc. I don't know
anything about modern optimizers. Before, it was a very tight loop,
which an optimizer could reasonably handle. Now, the mainstream is a
tight loop, but there is a conditional in it which can cause a
significant amount of code to be executed, that could fool the
optimizer. Perhaps the non-mainstream case should be put in a function.
I added some code to improve the efficiency of utf8 handling, so that if
perl has hard-coded into it the upper and lower cases of a character, it
doesn't have to go out to the general Unicode functions.
The most changes (as opposed to additions) are in uc_first(). Most of
these come from rearranging the code so that in all cases the changed
character is known before processing the rest of the string.
Previously, only if it was encoded in utf-8 would this preliminary step
be done. Otherwise, it was done as it went along.
An earlier author contemplated combining lc and uc into one function. I
haven't done that, yet. At the expense of two extra comparisons per
function call, I could save, perhaps as much space in perl as I've used
up by adding the code to do the new functionality.
I have added several code-generating macros. Normally, I don't like
these, but I think it makes things more readable here.
I have added many more comments than are typical in the Perl source. I
earlier got feedback that this might be a good thing.
If I don't get any feedback, I'll end up submitting a patch. This does
pass all regression tests. With the new behavior enabled it fails one,
which I have gotten permission from the author to change.
If you were to try compiling this, you would get missing symbol errors,
from macros in headers. Several of these should be obvious what they
mean, but here are definitions of ones that may not be:
IN_UNI_8_BIT is true if and only if characters in the range 128-255 are
to be treated as having upper and lower case as defined by Unicode.
When false, these routines should deliver identical results as they
always have.
toLOWER_LATIN1(c) takes a character in the range 0-255 and returns its
lower case as defined by Unicode.
toUPPER_LATIN1_MOD(c) takes a character in the range 0-255 and returns
its upper case as defined by Unicode, except for 3 tricky characters,
for which it returns a modified value, as explained in the code's comments.
UTF8_TWO_BYTE_HI(c) returns the first utf8-encoded byte for a Unicode
character c whose utf8 is known to take exactly two bytes. Similarly
for UTF8_TWO_BYTE_LO
Thread Next
-
[perl #58182] Unicode bug: code review request
by karl williamson