develooper Front page | perl.perl5.porters | Postings from January 2012

Re: Confounded [was: pack and ASCII]

Thread Previous | Thread Next
From:
Eric Brine
Date:
January 16, 2012 15:42
Subject:
Re: Confounded [was: pack and ASCII]
Message ID:
CALJW-qELvdvOWLkrF9yADS7ByufYdMfqb2Ot1Dgh7-XAcdW-Lw@mail.gmail.com
On Mon, Jan 16, 2012 at 5:17 AM, Leon Timmermans <fawaka@gmail.com> wrote:
> In utf8.pm, downgrade is defined as «Converts in-place the internal
> representation of the string from UTF-X to the equivalent octet
> sequence in the native encoding (Latin-1 or EBCDIC)».

That uses very poor terminology.

The name of the original format is utf8 (no dash).

And the result has nothing to do with Latin-1, EBCDIC or anything "native".
"char*" is the best name I can come up for it right now. People often call
this "byte string", but I find that confusing (since a string of bytes is
not always stored as a byte string).


On Mon, Jan 16, 2012 at 11:11 AM, Aristotle Pagaltzis <pagaltzis@gmx.de>
wrote:
> Until then, threads like this will be exercises in confusion as people
> mean different things when they say the same words – in fact often mean
> the same different things, only at opposite times, making communication
> all but an accident: no one can either hear what the other truly is
> saying or likewise truly be heard in turn.

Definitions in search of terms:

Basics:

   - An element of a string, as in what C<< substr($s, $i, 1) >> returns.
   It's a 72-bit value in theory, but it's limited to the size of a UV in
   practice.

I use "character" like the documentation of string functions (C<substr>,
C<index>, C<reverse>, C<chr>, C<ord>, etc) and Wikipedia's definition of
"string".

I have also used "string element", and will do so in the this post.


String element semantics:

   - A string element whose value is understood/expected to be in [0, 255]
   (regardless of the value of the UTF8 flag).

I use "byte", but that has caused confusion.

   - A string element whose value is understood/expected to be a Unicode
   code point (regardless of the value of the UTF8 flag).

I use "code point" or "Unicode code point".


String semantics:

   - A string whose value is understood/expected to be a sequence of values
   in [0, 255] (regardless of the value of its UTF8 flag).

I use "string of bytes", but that's too similar to "byte string" which
usually means something else.

   - A string whose value is understood/expected to be a sequence of
   Unicode code points (regardless of the value of its UTF8 flag).

I use "text" or "decoded text".


String storage formats:

   - The format of the PV in a string whose UTF8 flag is clear.

I use "UTF8=0 storage format". It's unambiguous, but it's quite a mouthful.
Others use "byte string", but those who do tend to use that for what I
called "strings of bytes" above too.

   - The format of the PV in a string whose UTF8 flag is set.

I use "UTF8=1 storage format". It's unambiguous, but it's quite a mouthful.
Others use "character string", but that's confusing because all strings are
made of characters by definition.

- Eric

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About