develooper Front page | perl.perl5.porters | Postings from January 2020

=?utf-8?B?UmU6IOKAnHN0cmljdOKAnSBzdHJpbmdzPw==?=

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
January 7, 2020 04:04
Subject:
=?utf-8?B?UmU6IOKAnHN0cmljdOKAnSBzdHJpbmdzPw==?=
Message ID:
D3248089-E8C2-4DB6-A954-ACAC1785B15D@felipegasper.com


> On Jan 6, 2020, at 10:36 PM, Tony Cook <tony@develop-help.com> wrote:
> 
> On Mon, Jan 06, 2020 at 10:02:10PM -0500, Felipe Gasper wrote:
>> 
>>> On Jan 6, 2020, at 9:43 PM, Tony Cook <tony@develop-help.com> wrote:
>>> 
>>> If Sereal converts a SVf_UTF8 off SV-with-PV to a binary specific type in some
>>> other language, that is a bug in Sereal.  I haven't tried it.
>> 
>> FWIW:
>> 
>>> perl -MSereal::Encoder -e'my $a = "\xc2\xa9"; print encode_sereal($a)' | xxd
>> 00000000: 3df3 726c 0400 62c2 a9                   =.rl..b..
>> 
>> The antepenultimate 0x62 is SHORT_BINARY_2.
>> 
>>> perl -MSereal::Encoder -e'my $a = "\xc2\xa9"; utf8::decode($a); print encode_sereal($a)' | xxd
>> 00000000: 3df3 726c 0400 2702 c2a9                 =.rl..'...
>> 
>> The [0x27 0x02] sequence indicates STR_UTF8, length 2.
>> 
>> 
>> And in fact, the module’s POD specifically states that it keys off SVf_UTF8. So there’s that.
> 
> That doesn't answer my supposition.

> If I understand the code either way it ends up with a string object,
> not a bytes object ( decoded with struct.unpack with a format of
> "<length>s" which returns a string ).

Python has a distinct byte-string type, though. With the new flag it behaves thus:

>>> from sereal.decoder import SrlDecoder
>>> decoder = SrlDecoder( bin_mode_classic=0 )
>>> decoder.decode(b'=\xf3rl\x04\x00b\xc2\xa9')
b'\xc2\xa9'
>>> decoder.decode(b'=\xf3rl\x04\x00\x27\x02\xc2\xa9')
'©'

Note that decoding the Sereal byte/“encodingless” string yields a Python byte string (the “b” prefix), whereas decoding the Sereal Unicode yields a text string. The only thing in Perl that caused the difference between the two Sereal documents is the UTF8 flag.

> 
> Java appears to treat it as a String rather than a byte[]:
> 
> https://github.com/Sereal/Sereal/blob/3a6c62eca003d1a35dd37ea38c6f71084607046d/Java/sereal/src/main/java/com/booking/sereal/Decoder.java#L727

This decoder appears to be configurable as well:

https://github.com/Sereal/Sereal/blob/3a6c62eca003d1a35dd37ea38c6f71084607046d/Java/sereal/src/main/java/com/booking/sereal/Decoder.java#L471

Unlike the Python implementation, though, in Java the default appears to be to return byte[]. I’ve not run it, though.

-FG
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About