Maybe I'm just missing the point here, but having functions that expose the internals seems to me the completely wrong way to handle the "how many octets is this utf8 encoded string". The original string in the perl model ought to be just a sequence of integers. and we need a function (could be a subfunction of unpack or whatever), that takes this sequence of encoded in UTF8, and returns a different sequence of integers, the octets in the UTF8 encoding. So to get the length in octets of an unicode string you would just do: $string = "any string even containing high codepoints"; $encoded = toutf8($string); print length $encoded; and $encoded would for the rest just be a normal perl string, which could in fact be internally encoded in all the different ways. Only the user would know this new sequence of integers is to be understood as a sequence of octets. For camel compatibility you could have a number of functions in use bytes where bytes::length is just an alias for length(toutf8(@_)). In short, UTF8 is just an encoding of the original sequence of integers, but you should get hold of that by asking perl to encode the sequence of integers for you, NOT by assuming that is their internal form and then exposing that. (the difference of course being that what I write above still works perfectly well even if the perl internal form were UCS4 or whatever).Thread Previous | Thread Next