develooper Front page | perl.perl5.porters | Postings from November 2008

[perl #58182] Unicode bug: More questions about coding

Thread Next
karl williamson
November 18, 2008 17:55
[perl #58182] Unicode bug: More questions about coding
Message ID:
I'm almost ready to submit my proposed changes for the uc(), lcfirst(), 
etc. functions for code review.  But I have several more questions.

These functions are all in pp.c.

Currently, if in a "use bytes" scope these functions treat the data as 
strict ASCII, and change the case accordingly.  Someone earlier 
suggested that this is a bug, that this mode is really for binary data 
only, and that the case should not change in this mode.  What should I do?

There are a couple cases where a string has to be converted to utf8. 
bytes_to_utf8() assumes the worst case that the new string will occupy 
2n+1 bytes, and allocates a new scalar with that size.  The code in 
these functions check every time through the processing characters loop 
to see if more space is needed, and if so grows the scalar by just that 
amount.  (This happens only in Unicode where the worst case may be more 
than 2n)  Which precedent would it be preferable for me to follow when 
the worst case is 2n?

The ucfirst() and lcfirst() functions are implemented in one function 
which branches at the crucial moment to do the upper or lower case and 
then comes back together.  Comments in the code ask if the same thing 
should happen for lc() and uc().  There are now several differences 
between the two, but the vast majority of these routines is identical. 
Should I do the combining or let it alone?

Finally, it would be trivial to change ucfirst() and lcfirst() so that 
if handed a utf8 string in which the first character (the only one being 
operated on) is in the strict ascii range, then to look up its case 
change in a compiled-in table instead of going out to the filesystem to 
look it up, as it must do for the general case.  The extra expense when 
this isn't true is an extra comparison, but if it is true, there is 
quite a bit of savings.  Shall I make this change?  An extension could 
be to even do this on characters in the 128-255 range, but there would 
need to be more extensive code changes, and extra tests, so I don't 
think that this is worth doing.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About