develooper Front page | perl.perl5.porters | Postings from February 2017

Re: Question about the inner workings of "storable".

Thread Previous
From:
Zefram
Date:
February 9, 2017 20:14
Subject:
Re: Question about the inner workings of "storable".
Message ID:
20170209201424.GF6573@fysh.org
Michel Albert wrote:
>                    Should strings in Perl (scalars?) be considered to be
>in the encoding of the source .pl file?

No.  A Perl string represents a defined sequence of Unicode codepoints;
where any decoding is required, normally that has already happened.
However, Perl does not distinguish between characters and octets.
Octets are aliased to Latin-1 characters.  So a Perl string containing
only codepoints below 0x100 might be intended to serve as an octet string.
In that case, the string represents a defined sequence of octets, and
no more.  In either case, the string does not carry an encoding with it.
An octet string cannot be decoded without other information about the
scheme by which it is encoded, and of course it might represent something
other than a character string.

Where a string encoding is used in a Storable file, the encoding (either
Latin-1 or UTF-8) is independent of any encoding used by programs that
generate or consume the string.  The internal encoding in the Storable
file carries no semantic information at all.  In particular, it does not
allow one to distinguish between octet strings and character strings.
There is no way in general to make that distinction; it is not a property
of the string data but of the way in which the string is used.

The term "scalar" in Perl refers to more than just strings.  Strings are
a very important type of scalar.

>                      Should numerical keys in hashes be deserialized as
>numericals as well?

Perl doesn't really make that distinction.  Numbers, to Perl, are
effectively a subtype of string.    A numeric-looking hash key should be
treated the same as a numeric-looking value string that was represented
by string means.  It is impossible in either case to determine whether
the string was intended to be used as a number.  The same goes for
a value represented by a numeric encoding in a Storable file (either
integer or floating point (not necessarily double, despite the Storable
nomenclature)): for Perl purposes this is a valid way to represent
a string.

>                        Is the following possible in Perl, and if yes, how
>are the two keys distinguished from each other?
>
>save_sample('mixed_hash',  {123 => 'a', '123' => 'b'});

The two hash keys are indistinguishable.  Perl does permit this
expression, because duplicate hash keys are generally permitted, the
last mention taking precedence.  Thus the hash has only one element.
The result will be the same as what you'd get from { 123 => "b" } or
{ "123" => "b" }.

>                      How does Perl know to store values as UTF-8?

As I indicated above in response to your first question, the internal
encoding has no semantic significance.  Any string containing a codepoint
of 0x100 or above can only be stored using the UTF-8 encoding, but a
string that does not can be stored either way.  Which it actually ends
up as is a matter of accident; generally whatever was most convenient
to generate given the processes by which the string was constructed.

>my $a = "\x{263A}";
>save_sample('utf8test01', \$a);

In this case, the string cannot be represented in Latin-1 because it
contains a non-Latin-1 codepoint.

-zefram

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About