I have a data frame with a column that contains a string with multiple names separated by commas:
df = data.frame(my.text = c("John Smith, Johnny Smith, John Smith", "John Doe, Doe, Johnny", c="Jane Doe, Jane Doe"))
df
my.text
1 John Smith, Johnny Smith, John Smith
2 John Doe, Doe, Johnny
3 Jane Doe, Jane Doe
I'd like to eliminate the duplicate names within in each row (i.e. get unique names) and store these at my.text so it looks this way:
df
my.text
1 John Smith, Johnny Smith
2 John Doe, Doe, Johnny
3 Jane Doe
This code achieves this for a single string/row:
df$mytext[1] = paste(unique(unlist(strsplit(df$mytext[1], split = ", "))), collapse = ", ")
But how do I apply this on the entire my.text column? I have tried mapply but cannot figure out how to send it so many functions all at once. Or perhaps there's a better way I'm overlooking?
CodePudding user response:
strsplit
is already vectorized, but to reduce it to a single string again, we can use lapply
and paste
:
sapply(strsplit(df$my.text, ",\\s*"), function(z) paste(unique(z), collapse = ", "))
# [1] "John Smith, Johnny Smith" "John Doe, Doe, Johnny" "Jane Doe"