Merge/collapse values from one column without duplicates keeping id of another column in R
Hello, I am trying to summarize a large dataframe in R, which I simplified here below:
site id sp
s1 A sp1,sp2,sp3,sp6
s1 A sp1,sp2,sp3,sp4,sp5
s1 A sp3,sp5,sp7
s2 B sp2,sp4
s2 B sp5,sp7
s3 C sp2,sp5
s3 C sp1,sp4
s3 C sp3,sp6
I would like to have a single record per site
collapsing the non-duplicate comma-separated values of the sp
column. The expected result is:
site id sp
s1 A sp1,sp2,sp3,sp4,sp5,sp6,sp7
s2 B sp2,sp4,sp5,sp7
s3 C sp1,sp2,sp3,sp4,sp5,sp6
What would be the best (and fastest) way to do it. Thank you very much in advance
I tried using group_by
but I'm not sure how to join the data from sp
and then return only the unique values.
CodePudding user response:
We could split the sp
column with separate_rows
, then get the distinct
of the data and do a group by 'site', 'id' and paste
the 'sp' values
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
separate_rows(sp) %>%
distinct %>%
group_by(site, id) %>%
summarise(sp = str_c(sort(sp), collapse = ","), .groups = 'drop')
-output
# A tibble: 3 × 3
site id sp
<chr> <chr> <chr>
1 s1 A sp1,sp2,sp3,sp4,sp5,sp6,sp7
2 s2 B sp2,sp4,sp5,sp7
3 s3 C sp1,sp2,sp3,sp4,sp5,sp6
data
df1 <- structure(list(site = c("s1", "s1", "s1", "s2", "s2", "s3", "s3",
"s3"), id = c("A", "A", "A", "B", "B", "C", "C", "C"), sp = c("sp1,sp2,sp3,sp6",
"sp1,sp2,sp3,sp4,sp5", "sp3,sp5,sp7", "sp2,sp4", "sp5,sp7", "sp2,sp5",
"sp1,sp4", "sp3,sp6")), class = "data.frame",
row.names = c(NA,
-8L))