develooper Front page | perl.perl5.porters | Postings from July 2011

Re: Solving the *real* Dot Problem (was: Is 5.16 the time to remove \N, the complement of \n, from being experimental?)

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
July 7, 2011 09:59
Subject:
Re: Solving the *real* Dot Problem (was: Is 5.16 the time to remove \N, the complement of \n, from being experimental?)
Message ID:
16481.1310057915@chthon
Abigail <abigail@abigail.be> wrote on Thu, 07 Jul 2011 11:51:58 +0200: 

> While Unicode is possible, almost all data I'm applying regexes to is
> ASCII data. I use /./ all the time, and for me, it just works. Where it
> doesn't, /(?s:.)/ does. /./ and /(?s:.)/ even works fine if I have mostly
> ASCII data with some Unicode characters or words thrown in.

> Full blown Unicode, which uses stuff where /./ or /(?s:.)/ won't work, 
> I've yet to have the need to parse it. 

Do you realize how "lucky" you are?  And, perhaps, how unusual?

The data we work with in biomedical text mining is asymptotically close
to being 100% Unicode data.  Think about the dozenish gigabytes of
the PubMed Open Access collection alone.  That's all in UTF-8 XML.
When we convert it to "plain text" for minding, we *must* handle
the &#x3B1; stuff correctly.  Wrong is not an option.

>> Here are 5 possible meanings for dot.  I start with the original and *LEAST
>> USEFUL OF ALL POSSIBLE MEANINGS*, and progress to the most useful ones, the
>> ones that I think people should usually be using these days:
>> 

>>     1 = no  re /s       (traditional and annoying)
>>     2 = use re /s       (necessary but insufficient)
>>     3 = \V              (improved #1)
>>     4 = \X              (improved #2)
>>     5 = \X unless \R    (improved #2, #3)
>> 
>> See?  How often do you guys write the *wrong* one of those?  

> Never.

> Abigail

Abigail, you are not just "one of the guys".  You are one of the
only people who understands all these differences.  I would be
sad if you had written the wrong one.

But I still bet most people do.

Please see my next letter.

--tom

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About