develooper Front page | perl.perl6.users | Postings from September 2019

Re: Is it possible for Str to not be well formed?

Thread Previous
From:
Elizabeth Mattijsen
Date:
September 17, 2019 08:51
Subject:
Re: Is it possible for Str to not be well formed?
Message ID:
C918D9ED-E23A-4D4F-9023-775674DCED76@dijkmat.nl
Short answer: ti my knowledge, if you can make a string contain invalid codepoints, it is a bug and should be reported so that it can be fixed.

> On 15 Sep 2019, at 23:08, Darren Duncan <darren@DarrenDuncan.net> wrote:
> 
> I'm defining an API that takes only well formed Str objects, meaning it would only accept Str whose Unicode codepoints are all in the set {0..0xD7FF,0xE000..0x10FFFF} and in particular there are no UTF-16 surrogate characters, and it would do so as a yes/no stricture without coercing anything outside of the set.
> 
> I am aware of how behind the scenes Perl 6 uses multiple levels of abstraction for Str objects, and in particular may often use Normal Form G to utilize codepoints above 0x10FFFF to be able to represent graphemes in constant space.
> 
> I have a few questions:
> 
> 1. Do I even have to test the Str at all?  Does Perl 6 guarantee that all Str are well formed, such that for example if one tried to decode UTF-16 that contained invalid surrogate codepoints (single ones or ones not properly paired up) that this would fail early, or is it possible that a Str could be created without fuss that contains the invalid surrogates?  I suspect Perl 6' inherent laziness would make passing through invalid codepoints more likely, but perhaps that isn't the case.
> 
> 2. Does Perl 6 ever have Str that are not internally in some normal form?  That is, if a file contains say a mixture of NFC and NFD, the actual codepoints will be preserved at the start until some operation requires them to be in a normal form?  I'm thinking this may be a good case for laziness, eg you don't need normal forms to just move data around, but it can help if you want to count graphemes, so it only normalizes when such an operation happens.
> 
> 3. If a Str can contain invalid surrogates or be wrong in some other way, what is the best / most performant way to test that a Str is only valid?  Context is akin to a "Str where ..." and what we put in the "...".
> 
> 4. How can I get the actual codepoints from a Str without normalizing them first?  I realize for typical use cases, explicitly using the NFC/NFD etc methods, or "ords" which uses NFC, is the most correct, but if say I just want what we already have, how would I do that?  I realize the result may not be particularly useful in the face of NFG.
> 
> For a wider context, I know that in other programming languages like .NET or Java it is possible for their strings to have invalid surrogates, and I'm trying to figure out if Perl 6 can have the same problem or not.
> 
> Thank you.
> 
> -- Darren Duncan

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About