develooper Front page | perl.perl6.internals.unicode | Postings from February 2001

Re: string encoding

Thread Previous | Thread Next
Hong Zhang
February 16, 2001 16:32
Re: string encoding
Message ID:
> > I have already given the counter argument. The codepoint position is
> > in many cases. They should be deprecated.
> Uh? That doesn't make sense. Codepoint position is *exactly* what people
> expect when they use substr. When I say
>     $a = substr($b,10);
> I want the 10th character. If I get the 10th byte, and we're using UTF-8
> as you suggest, I might be cutting into the middle of a character, leaving
> the resulting string malformed. That's horrific.

I think you already mixed the codepoint vc character. What you will get is
10th codepoint, not 10th character. If the 9th codepoint is a primary, and
10th is a combining codepoint or hungul-middle, you will get a semantically
malformed string. That's horrific too.

> No problem vs. problem.
> I know which I'd choose.

The UTF-32 has its problems too, such as cache locality, memory footprint,
encoding conversion. Many text files are in either ASCII or UTF-8, they
don't need much conversion other than validation. I have never seen,
even heard, anyone is using UTF-32 text file.

> > My understand the chop() can be very efficient under common cases,
> > for both UTF-8 and UTF-32.
> What about in the case of "abc\x{1F1E}"? UTF-32 or UTF-16 here is *vastly*
> more efficient than UTF-8.

I said it is not common case, and it is not what chop() supposed to do.
Don't forget you already pay the memory overhead in the front. You
basically pay the bill for some lunch you rarely eat.

> > The s/.// case is misleading too. If you define . as [^\n], the UTF-8
> > UTF-32 will have exactly the same performance
> No, no, no, no, no, no, no.
> UTF-16 case: Remove first two bytes
> UTF-8  case: Examine first byte, determine character width, remove n
> Now do that n times, and tell me which is more efficient.

You need to exmine the first two bytes for UTF-16 too, right?
One table lookup can determine the length of first codepoint.
Remove n bytes is the same cost as remove one byte.
Remember UTF-32 is not free.

> > Another example is m/S/i. The Unicode case mapping is one-to-many and
> > many-to-one, especially considering locale. Neither UTF-8 or UTF-32
> > will save you.
> That's irrelevant. The efficiency-significant part is skipping through the
> string, and knowing *exactly* how far you need to skip ahead is much more
> efficient than having to stop and recalculate it for each character.
> I really cannot understand how to express this any simpler or any more
> persuasively.
> UTF16 : s += 2;            : O(1) : Good
> UTF8  : s += UTF8WIDTH(*s) : O(n) : Bad

What I don't understand where you really use random access of string?
I just don't see it. Skipping through the string is mostly done by
Boyer-Moor algorithm. I have written both UTF-8 and UTF-16 version of
it. The UTF-16 will be in JDK1.4. I did not see any difference in
the skipping.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About