Aggregate sum of column values for all pairwise groupings of other columns in a dataframe in R-CodePudding

I have been trying to aggregate the sum of one column in a dataframe for all pairwise comparisons of the other columns in a data frame. The dataset I have is very large, but below is a dummy set that illustrates the issue I am having. I want to be able to do this so I can not repeat a ton of code to individually get these pairwise summations.

library(tidyverse)
library(broom)

data <- data.frame(team= c('A','B','C','A','B', 'A'),
       height= c('tall', 'short', 'tall','short','tall','tall'),
       size= c('big','small','big','big','small','small'),
       cost= c(5,5,4,4,5,4))

lapply(1:ncol(data), function(i) aggregate(data$cost~., data[c(1, i)], sum)) 

#This gives the results below grouping just first column (team) against
#the others and getting a sum :

[[1]]
  team team.1 data$cost
1     A       A        13
2     B       B        10
3     C       C         4

[[2]]
  team height data$cost
1     A  short         4
2     B  short         5
3     A   tall         9
4     B   tall         5
5     C   tall         4

[[3]]
  team  size data$cost
1     A   big         9
2     C   big         4
3     A small         4
4     B small        10

[[4]]
  team data$cost
1     A        13
2     B        10
3     C         4

What I want to avoid is having to manually replace the column number in the aggregate function, indicated by data[c(1, i)] to get the next set of pairwise groupings. Again the actual data frame is much much larger and this would be tedious.

I tried the following code and attempted to create a list of lists that I could unnest:

all_comparisons <- lapply(1:ncol(data), function(i) aggregate(data$cost~., 
                                                       data[c(c(1:i), i)], sum))

huge_list_all_comparisons <- all_comparisons %>% bind_rows(all_comparisons) %>%    # make larger sample data
  mutate_if(is.list, simplify_all) %>%    # flatten each list element internally 
  unnest()  

>huge_list_all_comparisons
 A tibble: 40 × 8
   team team.1 `data$cost` height height.1 size  size.1 cost.1
   <chr> <chr>         <dbl> <chr>  <chr>    <chr> <chr>   <dbl>
 1 A     A                13 NA     NA       NA    NA         NA
 2 B     B                10 NA     NA       NA    NA         NA
 3 C     C                 4 NA     NA       NA    NA         NA
 4 A     NA                4 short  short    NA    NA         NA
 5 B     NA                5 short  short    NA    NA         NA
 6 A     NA                9 tall   tall     NA    NA         NA
 7 B     NA                5 tall   tall     NA    NA         NA
 8 C     NA                4 tall   tall     NA    NA         NA
 9 A     NA                4 short  NA       big   big        NA
10 A     NA                5 tall   NA       big   big        NA
# … with 30 more rows

which returns cost sums of every single possible grouping, not just pairwise (in the real data set this would be prohibitive and lead to over a million rows of comparisons)

I would greatly appreciate any help getting some code that I could use to accomplish this pairwise group aggregation across the dataframe

CodePudding user response：

You can use combn() to get the possible combinations of indices and then lapply() over that.

library(tidyverse)

data |> 
  seq_along() |> 
  combn(2, simplify = F) |> 
  lapply(\(i) aggregate(data$cost~., data[c(i[1], i[2])], sum)) 
#> [[1]]
#>   team height data$cost
#> 1    A  short         4
#> 2    B  short         5
#> 3    A   tall         9
#> 4    B   tall         5
#> 5    C   tall         4
#> 
#> [[2]]
#>   team  size data$cost
#> 1    A   big         9
#> 2    C   big         4
#> 3    A small         4
#> 4    B small        10
#> 
#> [[3]]
#>   team data$cost
#> 1    A        13
#> 2    B        10
#> 3    C         4
#> 
#> [[4]]
#>   height  size data$cost
#> 1  short   big         4
#> 2   tall   big         9
#> 3  short small         5
#> 4   tall small         9
#> 
#> [[5]]
#>   height data$cost
#> 1  short         9
#> 2   tall        18
#> 
#> [[6]]
#>    size data$cost
#> 1   big        13
#> 2 small        14

^{Created on 2022-03-30 by the reprex package (v2.0.1)}