Home > Net >  How to remove punctuation excluding negations?
How to remove punctuation excluding negations?

Time:09-30

Let's assume I have the following sentence:


s = c("I don't want to remove punctuation for negations. Instead, I want to remove only general punctuation. For example, keep I wouldn't like it but remove Inter's fan or Man city's fan.")

I would like to have the following outcome:

"I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan."

At the moment if I simply use the code below, I remove both 's and ' in the negations.


  s %>%  str_replace_all("['']s\\b|[^[:alnum:][:blank:]@_]"," ")

 "I don t want to remove punctuation for negations  Instead  I want to remove only general punctuation           For example  keep I wouldn t like it but remove Inter  fan or Man city  fan "

To sum up, I need to have a code that removes general punctuation, including "'s" except for negations that I want to keep in their raw format.

Can anyone help me ?

Thanks!

CodePudding user response:

You can use a look ahead (?!t) testing that the [:punct:] is not followed by a t.

gsub("[[:punct:]](?!t)\\w?", "", s, perl=TRUE)
#[1] "I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan"

In case you want to be more strict you can test in addition if there is no n before with (?<!n).

gsub("(?<!n)[[:punct:]](?!t)\\w?", "", s, perl=TRUE)

Or in case to restrict it only to 't (thanks to @chris-ruehlemann)

gsub("(?!'t)[[:punct:]]\\w?", "", s, perl=TRUE)

CodePudding user response:

We can do it in two steps, remove all punctuation excluding "'", then remove "'s" using fixed match:

gsub("'s", "", gsub("[^[:alnum:][:space:]']", "", s), fixed = TRUE)
  • Related