develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
From:
Glenn Linderman
Date:
April 18, 2007 11:48
Subject:
Re: perl, the data, and the tf8 flag
Message ID:
462667E4.80403@NevCal.com
Well, the peanut gallery didn't add to this list, but other discussion 
on p5p made mention of another pair of operations that have ASCII vs 
Unicode semantics based on the type of the buffer, bytes, or multi-bytes:

uc
lc

So below I repeat this list to make it more complete, including also 
Juerd's point about \L, et alia, on non-constants.

Off-list discussion has also produced the related, perhaps surprising to 
non-implementers, following facts:

values greater that 2^32 can be represented in the multi-bytes strings.

multi-bytes string values are encoded in a varying length form...

*) Values less than 2^36 are represented as a variable length 
byte-stream varying from 1-7 bytes
*) Values greater than 2^36 are represented as a 13 byte stream starting 
with 0xff, and having 12 following bytes with the high-order bit on, the 
next bit off, and 6 data bits following (this is the same format used 
for all bu the first byte of the variable length data, also)
*) Values up to 2^72-1 seem possible to represent using this notation.  
12 bytes would seem to have sufficed to cover values through 2^64-1, or 
even 2^66-1, however, so the 13 bytes seems to have been an unusual choice.
*) Values are further limited to the values represented in 
platform-sized integers, however, which varies between platforms.
*) The platform-size integer limit seems to apply even to the "\x{}" 
notation.


On approximately 4/1/2007 4:05 PM, came the following characters from 
the keyboard of Glenn Linderman:
> ASCII semantics are used for:
>
> \L, \l, \U, \u, \Q operations in bytes string constants (no character 
> code values > 255)

\L, \l, \U, \u, \Q operations in bytes string variables
uc, lc operations when passed bytes string parameters (no character code 
values > 255)

> regexp suboperations when used with byte string parameter:
>   case-insensitivity
>   character classes \w \W \s \S \b \B
>   modifiers /i
>
> Unicode semantics are used for:
>
> \L, \l, \U, \u, \Q operations in multi-bytes string constants (at 
> least one character code value > 255)

\L, \l, \U, \u, \Q operations in multi-bytes string variables (no 
restrictions on character code values)
uc, lc operations when passed multi-bytes string parameters (no 
restrictions on character code values)

> regexp suboperations when used with multi-bytes string parameter:
>   case-insensitivity
>   match codes \w \W \s \S \b \B \Z
>   modifiers /i /m
>
> encode.pm and file handles with encoding layers explicitly define the 
> character semantics they use, which include ASCII, Unicode, and many 
> other encodings.
>
> Note: it appears to me that Perl (except for encode.pm) _never_ 
> applies Latin-1 semantics to anything, at present.  But people talk 
> about it, because if bytes strings are converted to multi-bytes 
> strings the result is the same as converting Latin-1 character codes 
> to Unicode character codes.
-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About