develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
From:
Glenn Linderman
Date:
April 1, 2007 16:35
Subject:
Re: perl, the data, and the tf8 flag
Message ID:
461041A0.1090405@NevCal.com
On approximately 4/1/2007 3:26 PM, came the following characters from 
the keyboard of demerphq:
> On 3/31/07, Glenn Linderman <perl@nevcal.com> wrote:

>> B) unpack -- it is not clear to me how this operation could successfully
>> process a multi-bytes buffer parameter, except by first downgrading it,
>> if it contains no values > 255, since all the operations on it are
>> defined in terms of unpacking bytes.
> 
> This is sortof what happens with pack *in blead*. Except that instead
> of downgrading it treats codepoints as being bytes by doing mod 256 on
> their values. 


Aha!  OK, this is a way that unpack could successfully operate on a 
multi-bytes buffer.  But I think it is also equivalent to downgrading it 
(with a warning for values > 255) and then processing it as bytes.  Not 
sure which one is more efficient, I would guess the first, but if the 
buffer is embedded in a potentially longer string, the number of bytes 
to downgrade is not known until the unpack operation is complete, and 
downgrading the stuff beyond what is needed for pack could result in the 
warning being issued unnecessarily....


This makes a certain amount of sense if you assume that
> strings can (apparenly) randomly change from octect encoding to utf8
> encoding. For instance:
> 
>  my $s=pack 'N',12345678;
>  $s.=chr(256); # upgrade $s to utf8 by catting on a unicode codepoint
>  chop $s;        # lose the catted codepoint, encoding remains utf8
>  print unpack 'N',$s; # prints 12345678
> 
> So 'N' works with codepoints, not with bytes. Apparently this holds
> true for most of the pack template formats. HOWEVER, it doesnt apply
> to the pattern 'C' (and if i understand his recent posts this is what
> Marc was objecting to recently) which reads bytes.


OK, this is the first time I fully understand what Marc was complaining 
about.  Thank you.

I think that pack-U should be defined to produce "encoded bytes" not 
"multi-bytes" and that the result buffer should not be upgraded... if 
someone wants to use pack-U to create a UTF string from a set of 
codepoint values, instead of \x{..}\x{...} or join('',chr(...),chr(...)) 
then they can just as well use

   decode_utf8(pack("U*",...,...));

Allowing the above %256 unpack sort of operation when unpack is handed a 
multi-bytes buffer also seems reasonable... but U should also operate by 
choosing one %256 byte from each value, and processing sufficient values 
to compose a complete decoded value.


> Which to me says that almost any use of 'C' as an unpack template in
> Perl 5.9.x and later will be totally wrong.  


I agree.  Thanks for a clear explanation.


> My feeling is that Marc's
> suggestion about making 'C' and alias for 'U' and introducing a new
> template char for what 'C' does currently (O for octect maybe) is the
> right thing to do, with warning when moding the result of 'U' results
> in a number larger than 255.


Well, if you define pack-U and unpack-U like I just did, then unpack-C 
would grab a value, warn if >255, and then %256.  This would be 
different than unpack-U, which would expect the first value of the 
sequence to be the value of a utf8 start byte, which would indicate how 
many values would then be extracted.

Maybe it is too late to define pack-U as producing encoded data in a 
bytes buffer, rather than causing the buffer to be a multi-bytes 
buffer... or maybe it can be considered a bug... but I think it is the 
way to go, to the extent that I'd recommend deprecating U if it cannot 
be fixed, and adding M (multi-byte variable-length encoded 32-bit value) 
to the pack/unpack repertoire.



>> Firstly, regular expressions deal in "characters", not bytes, or
>> multi-byte sequences.
> 
> I dont know where this meme comes from. Its just not true. 


The joke's on me.  I said that, but I was busy pointing out that bytes 
vs multi-bytes had nothing to do with character set semantics, or text 
vs binary semantics.  Of course, there is a small relationship... it 
happens to be a space-efficient storage encoding for typical 
latin-language-centric Unicode data sets... and thus worth the effort to 
implement.

And in the message I was writing when this one arrived, I effectively 
agreed with you on this topic in my list of Perl operations that 
implement character set semantics.


> Case insensitivity is the other place where you will see differences.
> The languages that most people on this list speak as a mother tongue
> have "uppercase" and "lowercase". Well it turns out that there are
> languages that have an additional case (titlecase) and that the
> commonly understood rules for doing a case insensitive match wont
> work. For instance a naive assumption would be that to do a case
> insensitive match you would either uppercase or lowercase all of the
> characters in the both strings and then proceed from there. Well this
> wont work with Greek say, and in fact it wont work with German either.


I don't know much about titlecase.  Thanks for the term.  I haven't read 
all the Unicode specs (in case it is there), just what I thought I 
needed to know, so far.  There's a lot to read, and only so much time.


> So to do case insensitive matching in unicode you need to do
> "foldcase" matching, which is that you convert the sequence into a
> normalized folded versions and then compare that. Where this gets
> tricky is that in some languages, German for example, the folded
> version of a particular letter is in fact more than one letter. So the
> foldcase of GERMAN-SHARP-ESS aka \x{DF} aka ß is 'ss'. The uppercase
> of the letter is ß, and unsurprisingly so is the lowercase.


The surprising part to me as that you say (and you are probably right) 
that ß uppercase is ß... an Austrian fellow once told me it was SS ... 
but then he was a programmer, not a German major!  But I understand the 
issue you raise about one letter vs two when altering case.



> Now where this gets really annoying is that \x{DF} is the ONLY letter
> in unicode that is in latin_1 that has a multibyte foldcase
> representation, yet at the same time Perl has never considered \x{DF}
> to match 'ss' in latin_1.
> 
> So if you have a string that contains \x{DF} youll find it will match
> case insensitively 'ss' if the string is in unicode, but not if its in
> latin_1.
> 
> Anyway, hope this clarifies things a bit.


Yes, thanks!  ß does sound annoying for latin_1 support.


> 
> Cheers,
> Yves
> 
> 


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About