Home > database >  Using unlist on a column of strings within a data frame
Using unlist on a column of strings within a data frame

Time:08-09

I have a data frame with a column that contains a string with multiple names separated by commas:

df = data.frame(my.text = c("John Smith, Johnny Smith, John Smith", "John Doe, Doe, Johnny", c="Jane Doe, Jane Doe"))

df
                               my.text
1 John Smith, Johnny Smith, John Smith
2                John Doe, Doe, Johnny
3                   Jane Doe, Jane Doe

I'd like to eliminate the duplicate names within in each row (i.e. get unique names) and store these at my.text so it looks this way:

df
                               my.text
1             John Smith, Johnny Smith
2                John Doe, Doe, Johnny
3                             Jane Doe

This code achieves this for a single string/row:

df$mytext[1] = paste(unique(unlist(strsplit(df$mytext[1], split = ", "))), collapse = ", ")

But how do I apply this on the entire my.text column? I have tried mapply but cannot figure out how to send it so many functions all at once. Or perhaps there's a better way I'm overlooking?

CodePudding user response:

strsplit is already vectorized, but to reduce it to a single string again, we can use lapply and paste:

sapply(strsplit(df$my.text, ",\\s*"), function(z) paste(unique(z), collapse = ", "))
# [1] "John Smith, Johnny Smith" "John Doe, Doe, Johnny"    "Jane Doe"                
  • Related