develooper Front page | perl.perl5.porters | Postings from November 2008

[perl #58182] Unicode bug: code review request

Thread Next
karl williamson
November 22, 2008 14:25
[perl #58182] Unicode bug: code review request
Message ID:
Attached for code review are changes to pp.c to enable case handling of 
characters in the range 128-255.  This is not a patch yet.  Since I'm 
new to changing the perl source, I wanted to get feedback before 
submitting a real patch.  Also, changes in a couple of other files 
depend on some unresolved issues.

There are two attachments, one is in standard patch format.  The other 
is an html file containing another type of diff that I prefer.  In it, 
the cyan background is for deleted things, the yellow for added; changes 
in line indentation are not shown.

I'm wondering what sorts of things I might be overlooking.  I do know 
that I don't know much about how overloading or magic might affect these 
routines.  I don't think my changes would affect these, but then, I 
don't know much about these.

There are a couple of comments marked TODO, which means I have special 
questions about them.

One of my design goals was to not slow things down unnecessarily.  I 
think that, if anything, I have sped things up.  One area of concern I 
have, though, in that regard is in the loop in pp_uc.  I don't know 
anything about modern optimizers.  Before, it was a very tight loop, 
which an optimizer could reasonably handle.  Now, the mainstream is a 
tight loop, but there is a conditional in it which can cause a 
significant amount of code to be executed, that could fool the 
optimizer.  Perhaps the non-mainstream case should be put in a function.

I added some code to improve the efficiency of utf8 handling, so that if 
perl has hard-coded into it the upper and lower cases of a character, it 
doesn't have to go out to the general Unicode functions.

The most changes (as opposed to additions) are in uc_first().  Most of 
these come from rearranging the code so that in all cases the changed 
character is known before processing the rest of the string. 
Previously, only if it was encoded in utf-8 would this preliminary step 
be done.  Otherwise, it was done as it went along.

An earlier author contemplated combining lc and uc into one function.  I 
haven't done that, yet.  At the expense of two extra comparisons per 
function call, I could save, perhaps as much space in perl as I've used 
up by adding the code to do the new functionality.

I have added several code-generating macros.  Normally, I don't like 
these, but I think it makes things more readable here.

I have added many more comments than are typical in the Perl source.  I 
earlier got feedback that this might be a good thing.

If I don't get any feedback, I'll end up submitting a patch.  This does 
pass all regression tests.  With the new behavior enabled it fails one, 
which I have gotten permission from the author to change.

If you were to try compiling this, you would get missing symbol errors, 
from macros in headers.  Several of these should be obvious what they 
mean, but here are definitions of ones that may not be:

IN_UNI_8_BIT is true if and only if characters in the range 128-255 are 
to be treated as having upper and lower case as defined by Unicode. 
When false, these routines should deliver identical results as they 
always have.

toLOWER_LATIN1(c) takes a character in the range 0-255 and returns its 
lower case as defined by Unicode.

toUPPER_LATIN1_MOD(c) takes a character in the range 0-255 and returns 
its upper case as defined by Unicode, except for 3 tricky characters, 
for which it returns a modified value, as explained in the code's comments.

UTF8_TWO_BYTE_HI(c) returns the first utf8-encoded byte for a Unicode 
character c whose utf8 is known to take exactly two bytes.  Similarly 

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About