On Thu, Aug 29, 2013 at 10:40 AM, Ricardo Signes <perl.p5p@rjbs.manxome.org>wrote: > * Karl Williamson <public@khwilliamson.com> [2013-08-26T23:00:52] > > A strict interpretation fails this because Unicode has never said that a > > non-Unicode code point should be considered unassigned. But I now > believe it > > is more DWIM to consider them so. > > Agreed. > > > Perl could change to make the fall-back value be what happens for > non-Unicode > > code points. This, I believe, is more DWIM. > > Agreed. > > > The reason I didn't do this, besides wanting to be very strict > > Unicode, is that there is a complication. Consider the Perl > > extension \p{Unassigned}, which is the same as \p{gc=Unassigned}. > > Currently these match 864_348 code points. If we changed the > > decision I made, these would now match billions of code points. > > Is that a problem? > > I guess the problem is that someone's program, previously, was doing this: > > my $str = "Sentinel point follows: \x{xFF_FFFF}"; > if ($str =~ /p{Unassigned}/) {...} > > ...and the branch will now be entered when it was not before? > > If that is the only problem, I'd like to hear from the folks who've talked > about using trans-Unicode codepoints and whether they think this is going > to > cause actual problems. My gut feeling is that we should feel free to > change > this unless the new semantics seem *wrong*, which they don't to me. After > all, > the docs mark this behavior as still in flux: > > I've used trans-Unicode codepoints before*, and was bitten by them not matching \p{Unassigned}; I understood the reasoning for having them not match, but I wish that they did. * I needed "a character that will never show up in database"Thread Previous | Thread Next