develooper Front page | perl.perl5.porters | Postings from November 2008

Re: [perl #58182] Unicode bug: More questions about coding

Thread Previous | Thread Next
From:
karl williamson
Date:
November 22, 2008 18:34
Subject:
Re: [perl #58182] Unicode bug: More questions about coding
Message ID:
4928C12D.8060807@khwilliamson.com
Glenn Linderman wrote:
> On approximately 11/22/2008 2:54 PM, came the following characters from 
> the keyboard of karl williamson:
>> Glenn Linderman wrote:
>>> On approximately 11/18/2008 5:55 PM, came the following characters 
>>> from the keyboard of karl williamson:
>>>> I'm almost ready to submit my proposed changes for the uc(), 
>>>> lcfirst(), etc. functions for code review.  But I have several more 
>>>> questions.
>>>
>>>
>>> I have no answers, only comments to be sure that you have considered 
>>> things, and some opinions.
>>>
>>>
>>>> These functions are all in pp.c.
>>>>
>>>> Currently, if in a "use bytes" scope these functions treat the data 
>>>> as strict ASCII, and change the case accordingly.  Someone earlier 
>>>> suggested that this is a bug, that this mode is really for binary 
>>>> data only, and that the case should not change in this mode.  What 
>>>> should I do?
>>>
>>>
>>> I think the two options are:
>>>
>>> 1) make these functions noops in "use bytes" mode, bringing them in 
>>> line with the "use bytes" documentation.
>>>
>>> 2) case convert only ASCII letters [A-Za-z] (no code change) and 
>>> change the "use bytes" documentation to admit that these functions do 
>>> affect ASCII characters even in "use bytes" mode.
>>>
>>> The second case is more compatible with today's actual (vs. 
>>> documented) behavior.  Noops can also be obtained in other ways, so 
>>> allowing them to continue to operate on the ASCII charcaters gives 
>>> more options.
>>>
>>
>> In thinking about this some more, it seems like if we wanted to make 
>> things noops in bytes mode, the place to do it would be the parser (or 
>> whatever it is that sets up the execution stack) so that functions 
>> that are noops aren't even called.
> 
> 
> Maybe so.  Would be a more efficient noop that way.  But what I meant, 
> is that if the user wants a noop, they wouldn't generally code it as
> 
> use bytes;
> $foo = uc( $foo )
> no bytes;
> 
>
Sure.  What I meant was that the function isn't the place to be coding 
in noop cases, so I could have answered my own question if I had thought 
about it.  I should leave it do what it always has done in bytes mode, 
and if someone wants to change perl so that it acts as documented, the 
place to do so is not the function, but the parser.

>>>> There are a couple cases where a string has to be converted to utf8. 
>>>> bytes_to_utf8() assumes the worst case that the new string will 
>>>> occupy 2n+1 bytes, and allocates a new scalar with that size.  The 
>>>> code in these functions check every time through the processing 
>>>> characters loop to see if more space is needed, and if so grows the 
>>>> scalar by just that amount.  (This happens only in Unicode where the 
>>>> worst case may be more than 2n)  Which precedent would it be 
>>>> preferable for me to follow when the worst case is 2n?
>>>
>>>
>>> Hmm.  So starting with bytes, the worst case is still 2n, no?  But 
>>> that assumes that (1) you need to convert from bytes to UTF-8 and (2) 
>>> that the number of characters in the 128-255 range is significant.  
>>> I'm not sure what you already know at the decision point... I would 
>>> guess that you know only your position in the string, when you first 
>>> realize that it must be lengthened.  Since even in Latin-1, most of 
>>> the case shifting stays within Latin-1, it seems unlikely that many 
>>> characters will grow.  But if you are forced to convert to UTF-8, 
>>> then ... if you kept track of the number of characters seen so far 
>>> with the high bit set, then at the decision point you would know to 
>>> allocate 1 byte for each ASCII character seen so far, 2 bytes for 
>>> each byte seen so far with the high bit set, and 2 bytes for each 
>>> character not yet processed.  That could result in a memory savings 
>>> vs. using twice the total length, yet would avoid repetitive 
>>> reallocations.
>>>
>>> But starting with Unicode, I don't know if there is a rule, but are 
>>> the uppercase and lowercase characters far enough apart, that they 
>>> change the size of their representation?  If there are such, it would 
>>> seem to be few, because most of the uppercase and lowercase 
>>> characters of each type are grouped together in the same Unicode 
>>> block.  So the current algorithm sounds appropriate for this.
>>>
>> The worst case in Unicode is 3:1.  But I decided to choose the worst 
>> case, as that is what is done when a string is upgraded to utf8.
> 
> 
> I'm assuming you are saying here that the worst case for a lowercase 
> character converted to uppercase, or an uppercase character converted to 
> lowercase, can be 3:1 (since these are the operations of concern), 
> rather than the worst case conversion of one character byte to one UTF-8 
> sequence is 3:1 (since I don't think that happens).
> 
> It is a space vs time tradeoff... and the results are highly dependent 
> on the data being manipulated...
> 
> So you allocate 3:1 space, if you don't use it, do you give it back, or 
> leave it dangle for the next potential operation?
>

The worst case for a single byte to utf8 is 2:1.  The worst case for in 
general changing the case of a utf8 character is 3:1, because of the 
extra modifiers that go with it.  The way the functions (not ones I have 
touched, by the way) work is that they essentially malloc enough space 
for the worst case for converting from byte to utf8.  Any extra dangles 
until the scalar's reference count goes to 0, when the entire amount is 
freed.  This extra may be needed if the variable is appended to, say. 
If the scalar has to grow beyond what is adjacent to the string (and I 
haven't really looked at this code, but am doing some surmising), a new 
string is allocated, the original's space is freed, and the scalar now 
has a different string pointer.


> 
>>>> The ucfirst() and lcfirst() functions are implemented in one 
>>>> function which branches at the crucial moment to do the upper or 
>>>> lower case and then comes back together.  Comments in the code ask 
>>>> if the same thing should happen for lc() and uc().  There are now 
>>>> several differences between the two, but the vast majority of these 
>>>> routines is identical. Should I do the combining or let it alone?
>>>
>>>
>>> You are coding it... you decide.  It can be smoked on blead, and 
>>> merged back to 5.10.x only once it is reasonably certain to be correct.
>>>
>>
>> I still am feeling my way about the change culture here.  I've worked 
>> on projects where only a major bug warranted a change--customers had 
>> to live with the ones that management didn't deem major enough.  And 
>> I've worked on projects where the developer was God, and could do 
>> whatever they liked.  I prefer ones where there is a discussion and 
>> consensus as to what should happen.
> 
> 
> My comment was similar to others I've seen here.  I'm by no means an 
> insider, although I've been hanging around quite a while.  You are 
> looking at the code and doing the work; as long as you have a reasonable 
> justification (like the comment you found) for the change, I think it 
> will fly.  Gratuitous changes don't seem to be particularly welcome, but 
> if it makes the code more correct, easier to maintain, shorter, not 
> measurably slower, things seem to be accepted.
> 
> 
>>>> Finally, it would be trivial to change ucfirst() and lcfirst() so 
>>>> that if handed a utf8 string in which the first character (the only 
>>>> one being operated on) is in the strict ascii range, then to look up 
>>>> its case change in a compiled-in table instead of going out to the 
>>>> filesystem to look it up, as it must do for the general case.  The 
>>>> extra expense when this isn't true is an extra comparison, but if it 
>>>> is true, there is quite a bit of savings.  Shall I make this 
>>>> change?  An extension could be to even do this on characters in the 
>>>> 128-255 range, but there would need to be more extensive code 
>>>> changes, and extra tests, so I don't think that this is worth doing.
>>>
>>>
>>> Sounds like a fair win, overall.  There are a significant number of 
>>> words in Latin-based languages that start with unaccented (ASCII) 
>>> first letters.
>>>
>>> Regarding the 128-255 range, it would be possible (I think) to make a 
>>> case shifting table indexed by ord, that contained (1) the case 
>>> shifted character (2) 0x0 or 0xff or 0x94 to flag that the table 
>>> doesn't work for this character.  Or the table entries could be two 
>>> bytes wide, so a truly out of range character could be used 0xffff. 
>>> (The one byte suggestions are thought to be unlikely to true 
>>> character strings that get passed to these functions, so a 
>>> performance hit when encountered wouldn't be onerous.)
>>
>> That's similar to what I did.
>>>
>>> Regarding the 0-128 range, it is possible to do the OR to lowercase 
>>> and AND to uppercase, if you check the range to be letters.  Not 
>>> clear that such is faster, only smaller.  The table would be more 
>>> general, and faster, and work for some of the non-ASCII.
>>>
>>
>> I don't understand what you're saying here,  but I use the existing 
>> code to handle characters in this range, which comes down to testing 
>> if 'A' <= c <= 'Z' on ASCII machines and then adding 32 to get the 
>> lower case; similar for lower case going the other way.  A table 
>> lookup is about the same amount of machine work.
> 
> 
> add 32 for numbers in that range is the same as OR 32; sub 32 for 
> numbers in that range is the same as AND ~ 32.  Flipping the bit via 
> logical operations, versus doing arithmetic.  6 of one, a half-dozen of 
> the other.  Logic operations used to be faster, way back when, because 
> there were no possibility of carries; with today's processors, it is 
> generally one clock for either.
> 
> Now, though, you've got me not understanding something.
> 
> If there was "ad hoc" logic to test for a-z A-Z ranges, and it is about 
> the same expense as a table, but that now you've invented a table for 
> the 128-255 range, wouldn't it be simpler overall to use the table for 
> the a-z A-Z ranges also?
> 
> 
I'm sorry I wasn't very clear.  Restated, in compatibility mode the 
existing macros are used which have the non-ascii chars be caseless, but 
in the new mode, the table is used for the entire 0-255 range.

>>> The table for ucfirst & lcfirst, could also be used for uc and lc, 
>>> no? You don't mention if that would be a win in those cases, or maybe 
>>> they already have a table?  Granted it only helps people that operate 
>>> in the ASCII (or maybe Latin-1) range, but there are a lot of them.
>>
>> I ended up doing it for all the functions, and including the range 
>> 128-255 without going to the general Unicode functions.  The expense 
>> for characters above 255 is 2 comparisons.  The payoff for those less 
>> is significant.
> 
> 


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About