Cumulative sum equivalent for string-CodePudding

I have a table which looks like:

Col A	Col b
a	2,3,4,5
a	3,5,6,7,8
b	1,2,4
b	3,5,7

I want to aggregate this table by Col A. The output should look like the following:

Col A	Col b
a	2,3,4,5,6,7,8
b	1,2,3,4,5,7

Please guide me on how I get the desirable output in R?

CodePudding user response：

Assuming the Col.b is a string column as in

quux <- structure(list(Col.A = c("a", "a", "b", "b"), Col.b = c("2,3,4,5", "3,5,6,7,8", "1,2,4", "3,5,7")), class = "data.frame", row.names = c(NA, -4L))

then we can do

library(dplyr)
quux %>%
  group_by(Col.A) %>%
  summarize(Col.b = paste(unique(unlist(strsplit(Col.b, ","))), collapse = ","))
# # A tibble: 2 × 2
#   Col.A Col.b        
#   <chr> <chr>        
# 1 a     2,3,4,5,6,7,8
# 2 b     1,2,4,3,5,7

If it is instead a list-column, as in

quux <- structure(list(Col.A = c("a", "a", "b", "b"), Col.b = list(c("2", "3", "4", "5"), c("3", "5", "6", "7", "8"), c("1", "2", "4"), c("3", "5", "7"))), row.names = c(NA, -4L), class = "data.frame")
quux
#   Col.A         Col.b
# 1     a    2, 3, 4, 5
# 2     a 3, 5, 6, 7, 8
# 3     b       1, 2, 4
# 4     b       3, 5, 7

then we can do

library(dplyr)
quux %>%
  group_by(Col.A) %>%
  summarize(Col.b = list(unique(unlist(Col.b)))) %>%
  as.data.frame()
#   Col.A               Col.b
# 1     a 2, 3, 4, 5, 6, 7, 8
# 2     b    1, 2, 4, 3, 5, 7

The trailing %>% as.data.frame() is not at all required, it is provided solely to demonstrate what Col.b now contains. Without it, it looks like this, which is value-wise equivalent:

quux %>%
  group_by(Col.A) %>%
  summarize(Col.b = list(unique(unlist(Col.b))))
# # A tibble: 2 × 2
#   Col.A Col.b    
#   <chr> <list>   
# 1 a     <chr [7]>
# 2 b     <chr [6]>

CodePudding user response：

One way is to use separate_rows to split and aggregate back toString on the unique sorted values, i.e.

library(dplyr)
library(tidyr)

df %>% 
 separate_rows(Colb, sep = ',') %>% 
 group_by(ColA) %>% 
 summarise(Colb = toString(sort(unique(Colb))))

# A tibble: 2 × 2
  ColA  Colb               
  <chr> <chr>              
1 a     2, 3, 4, 5, 6, 7, 8
2 b     1, 2, 3, 4, 5, 7

DATA

dput(df)
structure(list(ColA = c("a", "a", "b", "b"), Colb = c("2,3,4,5", 
"3,5,6,7,8", "1,2,4", "3,5,7")), class = "data.frame", row.names = c(NA, 
-4L))

CodePudding user response：

Using base, split on ColA, then split strings by comma, then get unique values, finally paste it back and convert it to dataframe:

stack(lapply(split(df$Colb, df$ColA), function(i){
  paste(unique(unlist(strsplit(i, ","))), collapse = ",")
  }))
#          values ind
# 1 2,3,4,5,6,7,8   a
# 2   1,2,4,3,5,7   b

CodePudding user response：

> dat <- data.table::data.table(a = c('a', 'a', 'b', 'b'), b = c('2,3,4,5', '3,5,6,7,8', '1,2,4', '3,5,7'))
> dat <- dat[, .(bc = paste0(b, collapse = ',')), by = .(a)]

> dat$bs <- mapply(paste0, mapply(unique, (mapply(strsplit, dat$bc, split = ','))), collapse = ',')
> dat      
   a                bc            bs
1: a 2,3,4,5,3,5,6,7,8 2,3,4,5,6,7,8                                                                       
2: b       1,2,4,3,5,7   1,2,4,3,5,7                        
> dat$bs                                                                                           
[1] "2,3,4,5,6,7,8" "1,2,4,3,5,7"