How to remove duplicate strings that have additional words and symbols?-CodePudding

Is there a method that can remove similar duplicated words from a comma separated string? There are a few methods out there but they seem to remove exact words only.

For example the following comma separated string is given below

words <- c("Hello, Hello, At desk (Idle), At desk (Idle)†, On floor (Active), On floor (Active)†, In meeting (Advisors), In meeting (Advisors)†, Day off (Birthday), Day off (Birthday)†")

and the desired result is

"Hello, At desk (Idle), On floor (Active), In meeting (Advisors), Day off (Birthday)"

What's been tried is

new.words <- strsplit(words, ",") sapply(dup_words, function(x) rle(x)$value)

which only removes the exact duplicated words and returns

"Hello, At desk (Idle), At desk (Idle)†, On floor (Active), On floor (Active)†, In meeting (Advisors), In meeting (Advisors)*, Day off (Birthday), Day off (Birthday)†"

only removing the duplicated Hello.

Thanks!

CodePudding user response：

Not 100% what you want, since the commas will disappear, butmybe it might help you

library(stringr)

words <- c("Hello, Hello, At desk, At desk (Idle), On floor, On floor (Active), In meeting, In meeting *, Day off, Day off †")

words %>%
  str_split(pattern = " ",simplify = TRUE) %>%
  str_split(pattern = ",",simplify = TRUE) %>% 
  as.vector() %>%
  unique() %>%
  str_c(collapse = " ")

[1] "Hello At desk (Idle) On floor (Active) In meeting * Day off † "

CodePudding user response：

Based on the updated data and expected

gsub("†", "", gsub("\\([^\\)]\\)\\s*", "",
    gsub("([^,] )(?:,\\s*\\1)*", "\\1", words)))

-output

[1] "Hello, At desk (Idle), On floor (Active), 
   In meeting (Advisors), Day off (Birthday)"