Abigail <abigail@abigail.be> wrote on Thu, 07 Jul 2011 11:51:58 +0200: > While Unicode is possible, almost all data I'm applying regexes to is > ASCII data. I use /./ all the time, and for me, it just works. Where it > doesn't, /(?s:.)/ does. /./ and /(?s:.)/ even works fine if I have mostly > ASCII data with some Unicode characters or words thrown in. > Full blown Unicode, which uses stuff where /./ or /(?s:.)/ won't work, > I've yet to have the need to parse it. Do you realize how "lucky" you are? And, perhaps, how unusual? The data we work with in biomedical text mining is asymptotically close to being 100% Unicode data. Think about the dozenish gigabytes of the PubMed Open Access collection alone. That's all in UTF-8 XML. When we convert it to "plain text" for minding, we *must* handle the α stuff correctly. Wrong is not an option. >> Here are 5 possible meanings for dot. I start with the original and *LEAST >> USEFUL OF ALL POSSIBLE MEANINGS*, and progress to the most useful ones, the >> ones that I think people should usually be using these days: >> >> 1 = no re /s (traditional and annoying) >> 2 = use re /s (necessary but insufficient) >> 3 = \V (improved #1) >> 4 = \X (improved #2) >> 5 = \X unless \R (improved #2, #3) >> >> See? How often do you guys write the *wrong* one of those? > Never. > Abigail Abigail, you are not just "one of the guys". You are one of the only people who understands all these differences. I would be sad if you had written the wrong one. But I still bet most people do. Please see my next letter. --tomThread Previous | Thread Next