Front page | perl.perl5.porters |
Postings from January 2012
Re: Confounded [was: pack and ASCII]
Thread Previous
|
Thread Next
From:
Eric Brine
Date:
January 16, 2012 15:42
Subject:
Re: Confounded [was: pack and ASCII]
Message ID:
CALJW-qELvdvOWLkrF9yADS7ByufYdMfqb2Ot1Dgh7-XAcdW-Lw@mail.gmail.com
On Mon, Jan 16, 2012 at 5:17 AM, Leon Timmermans <fawaka@gmail.com> wrote:
> In utf8.pm, downgrade is defined as «Converts in-place the internal
> representation of the string from UTF-X to the equivalent octet
> sequence in the native encoding (Latin-1 or EBCDIC)».
That uses very poor terminology.
The name of the original format is utf8 (no dash).
And the result has nothing to do with Latin-1, EBCDIC or anything "native".
"char*" is the best name I can come up for it right now. People often call
this "byte string", but I find that confusing (since a string of bytes is
not always stored as a byte string).
On Mon, Jan 16, 2012 at 11:11 AM, Aristotle Pagaltzis <pagaltzis@gmx.de>
wrote:
> Until then, threads like this will be exercises in confusion as people
> mean different things when they say the same words – in fact often mean
> the same different things, only at opposite times, making communication
> all but an accident: no one can either hear what the other truly is
> saying or likewise truly be heard in turn.
Definitions in search of terms:
Basics:
- An element of a string, as in what C<< substr($s, $i, 1) >> returns.
It's a 72-bit value in theory, but it's limited to the size of a UV in
practice.
I use "character" like the documentation of string functions (C<substr>,
C<index>, C<reverse>, C<chr>, C<ord>, etc) and Wikipedia's definition of
"string".
I have also used "string element", and will do so in the this post.
String element semantics:
- A string element whose value is understood/expected to be in [0, 255]
(regardless of the value of the UTF8 flag).
I use "byte", but that has caused confusion.
- A string element whose value is understood/expected to be a Unicode
code point (regardless of the value of the UTF8 flag).
I use "code point" or "Unicode code point".
String semantics:
- A string whose value is understood/expected to be a sequence of values
in [0, 255] (regardless of the value of its UTF8 flag).
I use "string of bytes", but that's too similar to "byte string" which
usually means something else.
- A string whose value is understood/expected to be a sequence of
Unicode code points (regardless of the value of its UTF8 flag).
I use "text" or "decoded text".
String storage formats:
- The format of the PV in a string whose UTF8 flag is clear.
I use "UTF8=0 storage format". It's unambiguous, but it's quite a mouthful.
Others use "byte string", but those who do tend to use that for what I
called "strings of bytes" above too.
- The format of the PV in a string whose UTF8 flag is set.
I use "UTF8=1 storage format". It's unambiguous, but it's quite a mouthful.
Others use "character string", but that's confusing because all strings are
made of characters by definition.
- Eric
Thread Previous
|
Thread Next