develooper Front page | perl.perl5.porters | Postings from January 2020

Re: =?utf-8?B?4oCcc3RyaWN04oCd?= strings?

Thread Previous | Thread Next
From:
Tony Cook
Date:
January 7, 2020 03:37
Subject:
Re: =?utf-8?B?4oCcc3RyaWN04oCd?= strings?
Message ID:
20200107033656.GE5228@mars.tony.develop-help.com
On Mon, Jan 06, 2020 at 10:02:10PM -0500, Felipe Gasper wrote:
> 
> > On Jan 6, 2020, at 9:43 PM, Tony Cook <tony@develop-help.com> wrote:
> > 
> > If Sereal converts a SVf_UTF8 off SV-with-PV to a binary specific type in some
> > other language, that is a bug in Sereal.  I haven't tried it.
> 
> FWIW:
> 
> > perl -MSereal::Encoder -e'my $a = "\xc2\xa9"; print encode_sereal($a)' | xxd
> 00000000: 3df3 726c 0400 62c2 a9                   =.rl..b..
> 
> The antepenultimate 0x62 is SHORT_BINARY_2.
> 
> > perl -MSereal::Encoder -e'my $a = "\xc2\xa9"; utf8::decode($a); print encode_sereal($a)' | xxd
> 00000000: 3df3 726c 0400 2702 c2a9                 =.rl..'...
> 
> The [0x27 0x02] sequence indicates STR_UTF8, length 2.
> 
> 
> And in fact, the module’s POD specifically states that it keys off SVf_UTF8. So there’s that.

That doesn't answer my supposition.

Do you get a bytes object or str object if you try to decode it in
Python 3.x?  This appears to have changed recently:

https://github.com/Sereal/Sereal/commit/df33d1c458d3baeb3c34ef319f30b940688a1964
(it looks like the old and still default behaviour is just plain
broken, decoding the binary string as UTF-8.)

If I understand the code either way it ends up with a string object,
not a bytes object ( decoded with struct.unpack with a format of
"<length>s" which returns a string ).

Java appears to treat it as a String rather than a byte[]:

https://github.com/Sereal/Sereal/blob/3a6c62eca003d1a35dd37ea38c6f71084607046d/Java/sereal/src/main/java/com/booking/sereal/Decoder.java#L727

Tony

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About