Let's assume I have the following sentence:
s = c("I don't want to remove punctuation for negations. Instead, I want to remove only general punctuation. For example, keep I wouldn't like it but remove Inter's fan or Man city's fan.")
I would like to have the following outcome:
"I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan."
At the moment if I simply use the code below, I remove both 's and ' in the negations.
s %>% str_replace_all("['']s\\b|[^[:alnum:][:blank:]@_]"," ")
"I don t want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn t like it but remove Inter fan or Man city fan "
To sum up, I need to have a code that removes general punctuation, including "'s" except for negations that I want to keep in their raw format.
Can anyone help me ?
Thanks!
CodePudding user response:
You can use a look ahead (?!t)
testing that the [:punct:]
is not followed by a t
.
gsub("[[:punct:]](?!t)\\w?", "", s, perl=TRUE)
#[1] "I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan"
In case you want to be more strict you can test in addition if there is no n
before with (?<!n)
.
gsub("(?<!n)[[:punct:]](?!t)\\w?", "", s, perl=TRUE)
Or in case to restrict it only to 't
(thanks to @chris-ruehlemann)
gsub("(?!'t)[[:punct:]]\\w?", "", s, perl=TRUE)
CodePudding user response:
We can do it in two steps, remove all punctuation excluding "'"
, then remove "'s"
using fixed match:
gsub("'s", "", gsub("[^[:alnum:][:space:]']", "", s), fixed = TRUE)