How can I delete all the duplicate words alongside the following comma and whitespace using Regex in R?
So far I have come up with the following regular expression, that matches the duplicate, however not the comma and whitespace. :
(\b\w \b)(?=[\S\s]*\b\1\b)
An example list would be:
blue, red, blue, yellow, green, blue
The output should look like:
blue, red, yellow, green
So it would have to match two of the "blue" in this case, as well as the following comma and whitespace (if there is any).
CodePudding user response:
Depends if your list is truly a list or if it is a string with comma's
# your data is actually already a list/vector
v <- c("blue", "red", "blue", "yellow", "green", "blue")
unique(v)
[1] "blue" "red" "yellow" "green"
# if your data is actually a comma seperated string
s <- "blue, red, blue, yellow, green, blue"
# if output needs to be a vector
unique(strsplit(s, ", ")[[1]])
[1] "blue" "red" "yellow" "green"
# if output needs to be a string again
paste(unique(strsplit(s, ", ")[[1]]), collapse = ", ")
[1] "blue, red, yellow, green"
Example based on the list column in a data.table or data.frame
dt <- data.table(
id = rep(1:5),
colors = list(
c("blue", "red", "blue", "yellow", "green", "blue"),
c("blue", "blue", "yellow", "green", "blue"),
c("blue", "red", "blue", "yellow"),
c("red", "red", "yellow", "yellow", "green", "blue"),
c("black")
)
)
## using data.table
library(data.table)
setDT(dt)
# use colors instead of clean_list to just fix the existing column
dt[, clean_list := lapply(colors, function(x) unique(x))]
## using dplyr
library(dplyr)
# use colors instead of clean_list to just fix the existing column
dt %>% mutate(clean_list = lapply(colors, function(x) unique(x)))
dt
# id colors clean_list
# 1: 1 blue,red,blue,yellow,green,blue blue,red,yellow,green
# 2: 2 blue,blue,yellow,green,blue blue,yellow,green
# 3: 3 blue,red,blue,yellow blue,red,yellow
# 4: 4 red,red,yellow,yellow,green,blue red,yellow,green,blue
# 5: 5 black black
# or just simply in base
dt$colors <- lapply(dt$colors, function(x) unique(x))
CodePudding user response:
We could use paste
with unique
and collapse
:
paste(unique(string), collapse= (", "))
[1] "blue, red, yellow, green"
data:
string <- c("blue", "red", "blue", "yellow", "green", "blue")