Well, the peanut gallery didn't add to this list, but other discussion on p5p made mention of another pair of operations that have ASCII vs Unicode semantics based on the type of the buffer, bytes, or multi-bytes: uc lc So below I repeat this list to make it more complete, including also Juerd's point about \L, et alia, on non-constants. Off-list discussion has also produced the related, perhaps surprising to non-implementers, following facts: values greater that 2^32 can be represented in the multi-bytes strings. multi-bytes string values are encoded in a varying length form... *) Values less than 2^36 are represented as a variable length byte-stream varying from 1-7 bytes *) Values greater than 2^36 are represented as a 13 byte stream starting with 0xff, and having 12 following bytes with the high-order bit on, the next bit off, and 6 data bits following (this is the same format used for all bu the first byte of the variable length data, also) *) Values up to 2^72-1 seem possible to represent using this notation. 12 bytes would seem to have sufficed to cover values through 2^64-1, or even 2^66-1, however, so the 13 bytes seems to have been an unusual choice. *) Values are further limited to the values represented in platform-sized integers, however, which varies between platforms. *) The platform-size integer limit seems to apply even to the "\x{}" notation. On approximately 4/1/2007 4:05 PM, came the following characters from the keyboard of Glenn Linderman: > ASCII semantics are used for: > > \L, \l, \U, \u, \Q operations in bytes string constants (no character > code values > 255) \L, \l, \U, \u, \Q operations in bytes string variables uc, lc operations when passed bytes string parameters (no character code values > 255) > regexp suboperations when used with byte string parameter: > case-insensitivity > character classes \w \W \s \S \b \B > modifiers /i > > Unicode semantics are used for: > > \L, \l, \U, \u, \Q operations in multi-bytes string constants (at > least one character code value > 255) \L, \l, \U, \u, \Q operations in multi-bytes string variables (no restrictions on character code values) uc, lc operations when passed multi-bytes string parameters (no restrictions on character code values) > regexp suboperations when used with multi-bytes string parameter: > case-insensitivity > match codes \w \W \s \S \b \B \Z > modifiers /i /m > > encode.pm and file handles with encoding layers explicitly define the > character semantics they use, which include ASCII, Unicode, and many > other encodings. > > Note: it appears to me that Perl (except for encode.pm) _never_ > applies Latin-1 semantics to anything, at present. But people talk > about it, because if bytes strings are converted to multi-bytes > strings the result is the same as converting Latin-1 character codes > to Unicode character codes. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration NetworkingThread Previous | Thread Next