develooper Front page | perl.perl5.porters | Postings from May 2014

Re: RFC: Making -B and -T work better on 8-bit encodings

Thread Previous | Thread Next
From:
Ricardo Signes
Date:
May 5, 2014 14:37
Subject:
Re: RFC: Making -B and -T work better on 8-bit encodings
Message ID:
20140505143716.GB26305@cancer.codesimply.com
* Karl Williamson <public@khwilliamson.com> [2014-05-02T16:28:48]
> The -B and -T file test operators don't really work well for non-ASCII text
> files, except perhaps under 'use locale'
> 
> I'm proposing to change them in several ways early in 5.21 to make sure that
> it doesn't adversely affect existing code before we decide for 5.22.

In general, these changes look like an improvement to me. I'd want to see
more opinions, if possible, as I don't use -T or -B in my usual work.

> The current UTF8 handling is haphazard, and I think suboptimal.  The patch
> changes things to see if the entire block is ASCII.  If not, it then looks
> to see if it is one long UTF-8 string.  The odds of something that passes
> that test for 512 bytes not being UTF-8 are vanishingly small.

Are we concerned about ending in the middle of a multibyte-sequence?

> I also think that Vertical Tab and Form Feed are so infrequent that they
> should be counted as non-text (currently VT is non-text, but FF is)

Seems reasonable to me.

> It represents a y with diaeresis in Latin1, which I believe occurs in
> modern French in only a couple of place names, and is not used in the other
> languages that Latin1 is designed for.

Also in the name of once-excellent Seattle hard rock band Queensrÿche.  Let's
not forget them!

> So I would change the patch to classify \xFF as a control.

Works for me, despite my love for the aforementioned band's early work. :)


-- 
rjbs

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About