develooper Front page | perl.perl5.porters | Postings from February 2007

bytes and codepoints (was Re: Future Perl development)

From:
Gerard Goossen
Date:
February 7, 2007 15:56
Subject:
bytes and codepoints (was Re: Future Perl development)
Message ID:
20070207235948.GA32672@ostwald
On Wed, Feb 07, 2007 at 10:44:55AM +0100, demerphq wrote:
> On 2/7/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> >
> >
> >However, all that encode/decode overhead would kill the performance
> >of these libraries, rendering them far less useful.  It would be nice
> >it Perl's internal encoding was always, officially UTF-8 -- then
> >there wouldn't be a conflict.  But I imagine that might be very hard
> >to pull off on EBCDIC systems, so maybe it's better this way -- I get
> >to choose not to support EBCDIC systems (along with systems that
> >don't use IEEE 754 floats, and systems where chars are bigger than a
> >byte).
> 
> I for one would argue that if we were going to go to a single internal
> encoding that utf8 would be the wrong one. Utf-16 would be much
> better. It would allow us to take advantage of the large amount of
> utf-16 code out there, ranging from DFA regexp engines to other
> algorithms and libraries. On Win32 the OS natively does utf-16 so much
> of the work would be done by the OS. Id bet that this was also a
> reason why other languages choose to use utf-16. In fact i wouldnt be
> surprised if we were the primary language using utf8 internally at
> all.

The default encoding of gcc is UTF-8, sure it doesn't do anything with
the multi-byte codepoints, and only deals with bytes. But if you have
for example the C code C<*++p = '{'> You are apping the character '{'
encoded as UTF-8 to the string.
On an EBCDIC platform C<*++p = '{'> appends the character '{' encoded
using UTF-EBCDIC (or EBCDIC since they give the same bytes). This is the
primary reason I prefer UTF-8 as default encoding on ASCII platforms, and
UTF-EBCDIC on EBCDIC platforms. If on some platform the C compiler generates
a 16-bit character for '{' I would recommend using UTF-16 as default
encoding.

 
> I mean heck, utf8 was a kudge worked out on a napkin to make it
> possible to store unicode filenames in a unix style filesystem. (utf8
> has the property that no encoding of a high codepoint contains any
> special character used by a unix filesystem) WTF would we use a kludge
> as our primary internal representation when there are better
> representations to use? Especially when you consider the performance
> impact of doing so (use unicode and watch the regex engine get much
> sloooooweeeeeerrrrrrr.)
>
> IMO UTF8 internally makes sense only when you condsider that most of
> the time stuff is happening using latin_1or whatever you want to call
> the single byte encoding we use.

Yes, and I would consider that a very good reason to use UTF-8.


> >>> I don't care whether $string is a text-string or byte-string, I
> >>> just want
> >>> it to returns the same string.
> >>
> >> Perhaps you should care. In a language such as Java, you are forced to
> >> care, as byte[] and String are different types. Perl blurs this
> >> difference,
> >> and lets you believe that you should not need to care.
> >
> >I agree, Mark.  Silent upgrading of bytes to Unicode strings cost me
> >a bunch of debugging time when I learned the hard way that you need
> >to care.  I was writing a serializer that concatenated Unicode
> >strings together with packed integers to make sort keys.  It never
> >occurred to me that such a concat operation would corrupt the packed
> >integer, and it took me a long time to hunt down why my sort op was
> >failing.
 
I think you misinterpreted me.
I am against silent upgrading.  ALWAYS leave the bytes alone.

If I just read a few bytes, and I try to do text
things on it like asking what the first character is
I am against any type of upgrading.
> I think this is the real itch. Before utf8 it was fine to think of
> strings as "byte buffers", but they arent byte buffers and never have
> been, they are strings, and strings dont contain bytes (whatever
> Gerrard thinks :-), they contain characters. \x{} doesnt _ever_
> produce a specifc byte, it produces a specific _character_ (that just
> happens to represented by a byte in one of the internal encodings we
> use).

I would say there are bytes, not codepoint. If you look into your computer
you won't find any codepoints there, you'll only find bytes (oke if you 
really look into your computer you would find real text like 'AMD' and 'nVidia').
The bytes in your computer can represent codepoints, but the thing
there are bytes. I think this is important because bytes don't suddenly
change, you can write them to a hard disk, read the back, it doesn't
matter they are still the same bytes. But the codepoints these bytes
represent can change. I had a nice ASCII string, now I think the same
bytes are UTF-16 and I suddenly have a totally different string.
C-libraries, networks, disks, they all deal with bytes, and normally
don't just change them. But the idea what these bytes represent might
get lost, but you might regain the meaning of these bytes, but if the
bytes are lost you don't have anything.
 

> The correct way to get a specific byte is via pack,  not via any
> string escape as string escape operate on characters. The fact that a
> given perl may internally encode strings as bytes is irrelevent.


Having a string escape to create a certain byte is very usefull, for
example, from POD/Simple/BlackBock.pm line 87:

     if( ($line = $source_line) =~ s/^\xEF\xBB\xBF//s ) {

This line is looking (and removing) for the UTF-8 BOM, so it isn't
looking for characters it is looking for bytes. I think we agree that the
above is not correct. I would like to use:

     if( ($line = $source_line) =~ s/^\x[EF]\x[BB]\x[BF]//s ) {

or

     if( ($line = $source_line) =~ s/^\x[EFBBBF]//s ) {

Instead of having to have to create a byte string to be used in the
regex. I agree that using pack is correct, but I think the above is also
correct and much more convenient.


> So, imo what we need is an SV flag that says "this is a byte buffer,
> it does not contain characters and anytime it is concatenated with a
> string the string should also be treated as a byte buffer regardless
> of its actual state." Thus concatenating a byte-buffer with a utf8
> string would not upgrade the bytebuffer.

I agree, except that I would prefer a SV flag that says "this is (valid) text
buffer", because I think it is easier to loose a flag then to gain it
(it is something extra, you always have bytes, sometimes they represent
codepoints).
And I prefer a text buffer loosing its "text buffer" flag then a byte
buffer loosing a "byte buffer" flag. Of course in a perfect world it
should not matter.



Gerard Goossen




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About