I have a table which looks like:
Col A | Col b |
---|---|
a | 2,3,4,5 |
a | 3,5,6,7,8 |
b | 1,2,4 |
b | 3,5,7 |
I want to aggregate this table by Col A. The output should look like the following:
Col A | Col b |
---|---|
a | 2,3,4,5,6,7,8 |
b | 1,2,3,4,5,7 |
Please guide me on how I get the desirable output in R?
CodePudding user response:
Assuming the Col.b
is a string column as in
quux <- structure(list(Col.A = c("a", "a", "b", "b"), Col.b = c("2,3,4,5", "3,5,6,7,8", "1,2,4", "3,5,7")), class = "data.frame", row.names = c(NA, -4L))
then we can do
library(dplyr)
quux %>%
group_by(Col.A) %>%
summarize(Col.b = paste(unique(unlist(strsplit(Col.b, ","))), collapse = ","))
# # A tibble: 2 × 2
# Col.A Col.b
# <chr> <chr>
# 1 a 2,3,4,5,6,7,8
# 2 b 1,2,4,3,5,7
If it is instead a list-column, as in
quux <- structure(list(Col.A = c("a", "a", "b", "b"), Col.b = list(c("2", "3", "4", "5"), c("3", "5", "6", "7", "8"), c("1", "2", "4"), c("3", "5", "7"))), row.names = c(NA, -4L), class = "data.frame")
quux
# Col.A Col.b
# 1 a 2, 3, 4, 5
# 2 a 3, 5, 6, 7, 8
# 3 b 1, 2, 4
# 4 b 3, 5, 7
then we can do
library(dplyr)
quux %>%
group_by(Col.A) %>%
summarize(Col.b = list(unique(unlist(Col.b)))) %>%
as.data.frame()
# Col.A Col.b
# 1 a 2, 3, 4, 5, 6, 7, 8
# 2 b 1, 2, 4, 3, 5, 7
The trailing %>% as.data.frame()
is not at all required, it is provided solely to demonstrate what Col.b
now contains. Without it, it looks like this, which is value-wise equivalent:
quux %>%
group_by(Col.A) %>%
summarize(Col.b = list(unique(unlist(Col.b))))
# # A tibble: 2 × 2
# Col.A Col.b
# <chr> <list>
# 1 a <chr [7]>
# 2 b <chr [6]>
CodePudding user response:
One way is to use separate_rows
to split and aggregate back toString
on the unique sorted values, i.e.
library(dplyr)
library(tidyr)
df %>%
separate_rows(Colb, sep = ',') %>%
group_by(ColA) %>%
summarise(Colb = toString(sort(unique(Colb))))
# A tibble: 2 × 2
ColA Colb
<chr> <chr>
1 a 2, 3, 4, 5, 6, 7, 8
2 b 1, 2, 3, 4, 5, 7
DATA
dput(df)
structure(list(ColA = c("a", "a", "b", "b"), Colb = c("2,3,4,5",
"3,5,6,7,8", "1,2,4", "3,5,7")), class = "data.frame", row.names = c(NA,
-4L))
CodePudding user response:
Using base, split on ColA, then split strings by comma, then get unique values, finally paste it back and convert it to dataframe:
stack(lapply(split(df$Colb, df$ColA), function(i){
paste(unique(unlist(strsplit(i, ","))), collapse = ",")
}))
# values ind
# 1 2,3,4,5,6,7,8 a
# 2 1,2,4,3,5,7 b
CodePudding user response:
> dat <- data.table::data.table(a = c('a', 'a', 'b', 'b'), b = c('2,3,4,5', '3,5,6,7,8', '1,2,4', '3,5,7'))
> dat <- dat[, .(bc = paste0(b, collapse = ',')), by = .(a)]
> dat$bs <- mapply(paste0, mapply(unique, (mapply(strsplit, dat$bc, split = ','))), collapse = ',')
> dat
a bc bs
1: a 2,3,4,5,3,5,6,7,8 2,3,4,5,6,7,8
2: b 1,2,4,3,5,7 1,2,4,3,5,7
> dat$bs
[1] "2,3,4,5,6,7,8" "1,2,4,3,5,7"