Front page | perl.perl5.porters |
Postings from November 2008
Re: [perl #58182] Unicode bug: More questions about coding
From: Glenn Linderman
November 22, 2008 16:27
Re: [perl #58182] Unicode bug: More questions about coding
Message ID: 4928A357.9060402@NevCal.com
On approximately 11/22/2008 2:54 PM, came the following characters from
the keyboard of karl williamson:
> Glenn Linderman wrote:
>> On approximately 11/18/2008 5:55 PM, came the following characters
>> from the keyboard of karl williamson:
>>> I'm almost ready to submit my proposed changes for the uc(),
>>> lcfirst(), etc. functions for code review. But I have several more
>> I have no answers, only comments to be sure that you have considered
>> things, and some opinions.
>>> These functions are all in pp.c.
>>> Currently, if in a "use bytes" scope these functions treat the data
>>> as strict ASCII, and change the case accordingly. Someone earlier
>>> suggested that this is a bug, that this mode is really for binary
>>> data only, and that the case should not change in this mode. What
>>> should I do?
>> I think the two options are:
>> 1) make these functions noops in "use bytes" mode, bringing them in
>> line with the "use bytes" documentation.
>> 2) case convert only ASCII letters [A-Za-z] (no code change) and
>> change the "use bytes" documentation to admit that these functions do
>> affect ASCII characters even in "use bytes" mode.
>> The second case is more compatible with today's actual (vs.
>> documented) behavior. Noops can also be obtained in other ways, so
>> allowing them to continue to operate on the ASCII charcaters gives
>> more options.
> In thinking about this some more, it seems like if we wanted to make
> things noops in bytes mode, the place to do it would be the parser (or
> whatever it is that sets up the execution stack) so that functions that
> are noops aren't even called.
Maybe so. Would be a more efficient noop that way. But what I meant,
is that if the user wants a noop, they wouldn't generally code it as
$foo = uc( $foo )
>>> There are a couple cases where a string has to be converted to utf8.
>>> bytes_to_utf8() assumes the worst case that the new string will
>>> occupy 2n+1 bytes, and allocates a new scalar with that size. The
>>> code in these functions check every time through the processing
>>> characters loop to see if more space is needed, and if so grows the
>>> scalar by just that amount. (This happens only in Unicode where the
>>> worst case may be more than 2n) Which precedent would it be
>>> preferable for me to follow when the worst case is 2n?
>> Hmm. So starting with bytes, the worst case is still 2n, no? But
>> that assumes that (1) you need to convert from bytes to UTF-8 and (2)
>> that the number of characters in the 128-255 range is significant.
>> I'm not sure what you already know at the decision point... I would
>> guess that you know only your position in the string, when you first
>> realize that it must be lengthened. Since even in Latin-1, most of
>> the case shifting stays within Latin-1, it seems unlikely that many
>> characters will grow. But if you are forced to convert to UTF-8, then
>> ... if you kept track of the number of characters seen so far with the
>> high bit set, then at the decision point you would know to allocate 1
>> byte for each ASCII character seen so far, 2 bytes for each byte seen
>> so far with the high bit set, and 2 bytes for each character not yet
>> processed. That could result in a memory savings vs. using twice the
>> total length, yet would avoid repetitive reallocations.
>> But starting with Unicode, I don't know if there is a rule, but are
>> the uppercase and lowercase characters far enough apart, that they
>> change the size of their representation? If there are such, it would
>> seem to be few, because most of the uppercase and lowercase characters
>> of each type are grouped together in the same Unicode block. So the
>> current algorithm sounds appropriate for this.
> The worst case in Unicode is 3:1. But I decided to choose the worst
> case, as that is what is done when a string is upgraded to utf8.
I'm assuming you are saying here that the worst case for a lowercase
character converted to uppercase, or an uppercase character converted to
lowercase, can be 3:1 (since these are the operations of concern),
rather than the worst case conversion of one character byte to one UTF-8
sequence is 3:1 (since I don't think that happens).
It is a space vs time tradeoff... and the results are highly dependent
on the data being manipulated...
So you allocate 3:1 space, if you don't use it, do you give it back, or
leave it dangle for the next potential operation?
>>> The ucfirst() and lcfirst() functions are implemented in one function
>>> which branches at the crucial moment to do the upper or lower case
>>> and then comes back together. Comments in the code ask if the same
>>> thing should happen for lc() and uc(). There are now several
>>> differences between the two, but the vast majority of these routines
>>> is identical. Should I do the combining or let it alone?
>> You are coding it... you decide. It can be smoked on blead, and
>> merged back to 5.10.x only once it is reasonably certain to be correct.
> I still am feeling my way about the change culture here. I've worked on
> projects where only a major bug warranted a change--customers had to
> live with the ones that management didn't deem major enough. And I've
> worked on projects where the developer was God, and could do whatever
> they liked. I prefer ones where there is a discussion and consensus as
> to what should happen.
My comment was similar to others I've seen here. I'm by no means an
insider, although I've been hanging around quite a while. You are
looking at the code and doing the work; as long as you have a reasonable
justification (like the comment you found) for the change, I think it
will fly. Gratuitous changes don't seem to be particularly welcome, but
if it makes the code more correct, easier to maintain, shorter, not
measurably slower, things seem to be accepted.
>>> Finally, it would be trivial to change ucfirst() and lcfirst() so
>>> that if handed a utf8 string in which the first character (the only
>>> one being operated on) is in the strict ascii range, then to look up
>>> its case change in a compiled-in table instead of going out to the
>>> filesystem to look it up, as it must do for the general case. The
>>> extra expense when this isn't true is an extra comparison, but if it
>>> is true, there is quite a bit of savings. Shall I make this change?
>>> An extension could be to even do this on characters in the 128-255
>>> range, but there would need to be more extensive code changes, and
>>> extra tests, so I don't think that this is worth doing.
>> Sounds like a fair win, overall. There are a significant number of
>> words in Latin-based languages that start with unaccented (ASCII)
>> first letters.
>> Regarding the 128-255 range, it would be possible (I think) to make a
>> case shifting table indexed by ord, that contained (1) the case
>> shifted character (2) 0x0 or 0xff or 0x94 to flag that the table
>> doesn't work for this character. Or the table entries could be two
>> bytes wide, so a truly out of range character could be used 0xffff.
>> (The one byte suggestions are thought to be unlikely to true character
>> strings that get passed to these functions, so a performance hit when
>> encountered wouldn't be onerous.)
> That's similar to what I did.
>> Regarding the 0-128 range, it is possible to do the OR to lowercase
>> and AND to uppercase, if you check the range to be letters. Not clear
>> that such is faster, only smaller. The table would be more general,
>> and faster, and work for some of the non-ASCII.
> I don't understand what you're saying here, but I use the existing code
> to handle characters in this range, which comes down to testing if 'A'
> <= c <= 'Z' on ASCII machines and then adding 32 to get the lower case;
> similar for lower case going the other way. A table lookup is about the
> same amount of machine work.
add 32 for numbers in that range is the same as OR 32; sub 32 for
numbers in that range is the same as AND ~ 32. Flipping the bit via
logical operations, versus doing arithmetic. 6 of one, a half-dozen of
the other. Logic operations used to be faster, way back when, because
there were no possibility of carries; with today's processors, it is
generally one clock for either.
Now, though, you've got me not understanding something.
If there was "ad hoc" logic to test for a-z A-Z ranges, and it is about
the same expense as a table, but that now you've invented a table for
the 128-255 range, wouldn't it be simpler overall to use the table for
the a-z A-Z ranges also?
>> The table for ucfirst & lcfirst, could also be used for uc and lc, no?
>> You don't mention if that would be a win in those cases, or maybe they
>> already have a table? Granted it only helps people that operate in
>> the ASCII (or maybe Latin-1) range, but there are a lot of them.
> I ended up doing it for all the functions, and including the range
> 128-255 without going to the general Unicode functions. The expense for
> characters above 255 is 2 comparisons. The payoff for those less is
Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking