develooper Front page | perl.perl5.porters | Postings from February 2007

Re: bytes and codepoints (was Re: Future Perl development)

February 7, 2007 16:29
Re: bytes and codepoints (was Re: Future Perl development)
Message ID:
On Thu, Feb 08, 2007 at 12:59:48AM +0100, Gerard Goossen wrote:
> > I for one would argue that if we were going to go to a single internal
> > encoding that utf8 would be the wrong one. Utf-16 would be much
> > better. It would allow us to take advantage of the large amount of
> > utf-16 code out there, ranging from DFA regexp engines to other
> > algorithms and libraries. On Win32 the OS natively does utf-16 so much
> > of the work would be done by the OS. Id bet that this was also a
> > reason why other languages choose to use utf-16. In fact i wouldnt be
> > surprised if we were the primary language using utf8 internally at
> > all.
> The default encoding of gcc is UTF-8, sure it doesn't do anything with
> the multi-byte codepoints, and only deals with bytes. But if you have
> for example the C code C<*++p = '{'> You are apping the character '{'
> encoded as UTF-8 to the string.

This is false. GCC 'char' is in native encoding - not UTF-8. Try using
an accented character from an 8-bit value. Does GCC generate one character?
Or two?

> On an EBCDIC platform C<*++p = '{'> appends the character '{' encoded
> using UTF-EBCDIC (or EBCDIC since they give the same bytes).

Also false. On an EBCDIC platform, C strings are EBCDIC encoded.

> This is the
> primary reason I prefer UTF-8 as default encoding on ASCII platforms, and
> UTF-EBCDIC on EBCDIC platforms.

As a primary reason, it is faulty.

You really need to separate storage representation from input format.
The "{" can be input as an EBCDIC "{", translated to UNICODE, and then
stored in memory using UTF-16. There is *no* requirement that the format
of the source file matches the format that Perl uses internally to
storage the data. You should not need to care.

> If on some platform the C compiler generates
> a 16-bit character for '{' I would recommend using UTF-16 as default
> encoding.

For Perl? Or for the platform? Why would you make a decision based upon
a native implementation detail? Do you not value portability? UNICODE
is portable. EBCDIC is *not*. That ASCII and UNICODE overlap is not
a reason for EBCDIC and UNICODE to be treated as if they overlap.
UTF-EBCDIC is an abomination as far as I am concerned. I think IBM
should kill EBCDIC once and for all.

> I am against silent upgrading.  ALWAYS leave the bytes alone.

The only reason you care whether the bytes are left alone or not, is
that you wish to make assumptions about how the data is generated and
used. If you put this behind you - it would stop mattering.

> If I just read a few bytes, and I try to do text
> things on it like asking what the first character is
> I am against any type of upgrading.

I'm not sure why you care. The reasons you have should related to
practical issues such as performance or concurrency. As politics, you
should not care. It is internal implementation detail. If Perl notices
that your 200Kbyte string is entirely composed of "\0", you should not
care at all if Perl swaps in an internal representation that takes the
form of ('\0' x 200K) (one character and one integer instead of

> > I think this is the real itch. Before utf8 it was fine to think of
> > strings as "byte buffers", but they arent byte buffers and never have
> > been, they are strings, and strings dont contain bytes (whatever
> > Gerrard thinks :-), they contain characters. \x{} doesnt _ever_
> > produce a specifc byte, it produces a specific _character_ (that just
> > happens to represented by a byte in one of the internal encodings we
> > use).
> I would say there are bytes, not codepoint. If you look into your computer
> you won't find any codepoints there, you'll only find bytes (oke if you 
> really look into your computer you would find real text like 'AMD' and 'nVidia').

If you look into your computer you won't find word processors or
music players either. This doesn't make them less real.

> The bytes in your computer can represent codepoints, but the thing
> there are bytes. I think this is important because bytes don't suddenly
> change, you can write them to a hard disk, read the back, it doesn't
> matter they are still the same bytes. But the codepoints these bytes
> represent can change. I had a nice ASCII string, now I think the same
> bytes are UTF-16 and I suddenly have a totally different string.
> C-libraries, networks, disks, they all deal with bytes, and normally
> don't just change them. But the idea what these bytes represent might
> get lost, but you might regain the meaning of these bytes, but if the
> bytes are lost you don't have anything.

There is a reason people do not code in assembly language any more,
unless they need to. Abstractions simplify our world. There is no
reason why the modern average programmer should need to know the
difference between ASCII and EBCDIC. Languages such as Java conceal
this difference very well. Your use of UTF-EBCDIC does not. What will
happen if your UTF-EBCDIC Perl program tries to talk to a UTF-8 Perl
program running on Linux? Will they be speaking the same language?
Will they understand each other?

I believe your focus is wrong. It seems you want Perl to speak EBCDIC
natively, and do not see any the reasons why UNICODE was designed.

Your EBCDIC/UNICODE hybrid is not UNICODE.


-- / /     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                  Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About