Why R gsub (or regexp) for punctuation doesn't get all punctuation?

Question

I am working on cleaning up a text-based data file and cannot figure out how why the gsub("[[:punct:]]", "", X1) is not giving a match for all punctuation. Unfortunately, I cannot replicate the problem here, which makes me think it is a character encoding issue -- the punctuation in question have an appearance that is obviously different from standard ASCII.

Is this a problem that I can solve after reading in the files, or do I have to do something at the front end? For example, Hadley's post on an encoding issue makes me think that I need to specifying the encoding statement when I read the files. However, I am reading a bunch of different txt files from a folder, so I am not sure the best solution. Basically, I just want to retain all letters [A-Za-z] and exclude everything else. (That said, gsub([^A-Za-z], "", X1) doesn't work either!)

Any suggestions on handling this problem would be greatly appreciated!

you can define your own character classes with whatever punctuation you need or what it is not getting `gsub('[.,:]', '', '.,:?;')` would that work? — rawr, Mar 11 '15 at 00:42
I think that could work, but the solution doesn't scale. I have a bunch of different characters in the different format that need to be addressed. My hope is something that would essentially discard every single character that is not a letter. — Brian P, Mar 11 '15 at 00:45
then that would be `gsub('\\W', '', 'fasdfa.,:asdf?;adfa')`, correct? — rawr, Mar 11 '15 at 00:49
â€œâ€” This is what I am trying to figure out! The solutions aren't grabbing those characters ... — Brian P, Mar 11 '15 at 01:07
What is it about `gsub("[^A-Za-z]", "", X1)` that "doesn't work"? It seemed to "work" well when I tried it on your counter-example. — IRTFM, Mar 11 '15 at 01:14

Casimir et Hippolyte · Accepted Answer · 2015-03-11T02:13:54.893

Probably the punctuation character is out of the ascii range. By default [[:punct:]] contains only ascii punctuation characters. But you can extend the class to unicode with the (*UCP) directive. But this doesn't suffice, you need to inform the regex engine that it must read the target string as an utf encoded string with (*UTF) (otherwise a multibyte encoded character will be seen as several one byte characters). So:

gsub("(*UCP)(*UTF)[[:punct:]]", "", X1, perl=T)

Note: these two directives exist only in perl mode and must be at the very begining of the pattern.

Note2: you can do the same like this:

gsub("(*UTF)\\pP+", "", X1, perl=T)

Because \pP is a shorthand for all unicode punctation characters, (*UCP) becomes useless.

Excellent explanation for a very useful capability! Thank you. — lawyeR, Mar 11 '15 at 10:09

Why R gsub (or regexp) for punctuation doesn't get all punctuation?

1 Answers1

Linked