develooper Front page | perl.perl5.porters | Postings from November 2008

Re: [perl #58182] Unicode bug: code review request

Thread Previous
karl williamson
November 22, 2008 15:14
Re: [perl #58182] Unicode bug: code review request
Message ID:
karl williamson wrote:
> Attached for code review are changes to pp.c to enable case handling of 
> characters in the range 128-255.  This is not a patch yet.  Since I'm 
> new to changing the perl source, I wanted to get feedback before 
> submitting a real patch.  Also, changes in a couple of other files 
> depend on some unresolved issues.
> There are two attachments, one is in standard patch format.  The other 
> is an html file containing another type of diff that I prefer.  In it, 
> the cyan background is for deleted things, the yellow for added; changes 
> in line indentation are not shown.
> I'm wondering what sorts of things I might be overlooking.  I do know 
> that I don't know much about how overloading or magic might affect these 
> routines.  I don't think my changes would affect these, but then, I 
> don't know much about these.
> There are a couple of comments marked TODO, which means I have special 
> questions about them.
> One of my design goals was to not slow things down unnecessarily.  I 
> think that, if anything, I have sped things up.  One area of concern I 
> have, though, in that regard is in the loop in pp_uc.  I don't know 
> anything about modern optimizers.  Before, it was a very tight loop, 
> which an optimizer could reasonably handle.  Now, the mainstream is a 
> tight loop, but there is a conditional in it which can cause a 
> significant amount of code to be executed, that could fool the 
> optimizer.  Perhaps the non-mainstream case should be put in a function.
> I added some code to improve the efficiency of utf8 handling, so that if 
> perl has hard-coded into it the upper and lower cases of a character, it 
> doesn't have to go out to the general Unicode functions.
> The most changes (as opposed to additions) are in uc_first().  Most of 
> these come from rearranging the code so that in all cases the changed 
> character is known before processing the rest of the string. Previously, 
> only if it was encoded in utf-8 would this preliminary step be done.  
> Otherwise, it was done as it went along.
> An earlier author contemplated combining lc and uc into one function.  I 
> haven't done that, yet.  At the expense of two extra comparisons per 
> function call, I could save, perhaps as much space in perl as I've used 
> up by adding the code to do the new functionality.
> I have added several code-generating macros.  Normally, I don't like 
> these, but I think it makes things more readable here.
> I have added many more comments than are typical in the Perl source.  I 
> earlier got feedback that this might be a good thing.
> If I don't get any feedback, I'll end up submitting a patch.  This does 
> pass all regression tests.  With the new behavior enabled it fails one, 
> which I have gotten permission from the author to change.
> If you were to try compiling this, you would get missing symbol errors, 
> from macros in headers.  Several of these should be obvious what they 
> mean, but here are definitions of ones that may not be:
> IN_UNI_8_BIT is true if and only if characters in the range 128-255 are 
> to be treated as having upper and lower case as defined by Unicode. When 
> false, these routines should deliver identical results as they always have.
> toLOWER_LATIN1(c) takes a character in the range 0-255 and returns its 
> lower case as defined by Unicode.
> toUPPER_LATIN1_MOD(c) takes a character in the range 0-255 and returns 
> its upper case as defined by Unicode, except for 3 tricky characters, 
> for which it returns a modified value, as explained in the code's comments.
> UTF8_TWO_BYTE_HI(c) returns the first utf8-encoded byte for a Unicode 
> character c whose utf8 is known to take exactly two bytes.  Similarly 

The email that got through the filters stripped off my html file, I 
presume for security reasons.

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About