Merge/collapse values from one column without duplicates, keeping ids of another column in R-CodePudding

Merge/collapse values from one column without duplicates keeping id of another column in R

Hello, I am trying to summarize a large dataframe in R, which I simplified here below:

site    id  sp
s1  A   sp1,sp2,sp3,sp6
s1  A   sp1,sp2,sp3,sp4,sp5
s1  A   sp3,sp5,sp7
s2  B   sp2,sp4
s2  B   sp5,sp7
s3  C   sp2,sp5
s3  C   sp1,sp4
s3  C   sp3,sp6

I would like to have a single record per site collapsing the non-duplicate comma-separated values of the sp column. The expected result is:

site    id  sp
s1  A   sp1,sp2,sp3,sp4,sp5,sp6,sp7
s2  B   sp2,sp4,sp5,sp7
s3  C   sp1,sp2,sp3,sp4,sp5,sp6

What would be the best (and fastest) way to do it. Thank you very much in advance

I tried using group_by but I'm not sure how to join the data from sp and then return only the unique values.

CodePudding user response：

We could split the sp column with separate_rows, then get the distinct of the data and do a group by 'site', 'id' and paste the 'sp' values

library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
   separate_rows(sp) %>%
   distinct %>%
   group_by(site, id) %>%
   summarise(sp = str_c(sort(sp), collapse = ","), .groups = 'drop')

-output

# A tibble: 3 × 3
  site  id    sp                         
  <chr> <chr> <chr>                      
1 s1    A     sp1,sp2,sp3,sp4,sp5,sp6,sp7
2 s2    B     sp2,sp4,sp5,sp7            
3 s3    C     sp1,sp2,sp3,sp4,sp5,sp6

data

df1 <- structure(list(site = c("s1", "s1", "s1", "s2", "s2", "s3", "s3", 
"s3"), id = c("A", "A", "A", "B", "B", "C", "C", "C"), sp = c("sp1,sp2,sp3,sp6", 
"sp1,sp2,sp3,sp4,sp5", "sp3,sp5,sp7", "sp2,sp4", "sp5,sp7", "sp2,sp5", 
"sp1,sp4", "sp3,sp6")), class = "data.frame",
 row.names = c(NA, 
-8L))