I have been trying to aggregate the sum of one column in a dataframe for all pairwise comparisons of the other columns in a data frame. The dataset I have is very large, but below is a dummy set that illustrates the issue I am having. I want to be able to do this so I can not repeat a ton of code to individually get these pairwise summations.
library(tidyverse)
library(broom)
data <- data.frame(team= c('A','B','C','A','B', 'A'),
height= c('tall', 'short', 'tall','short','tall','tall'),
size= c('big','small','big','big','small','small'),
cost= c(5,5,4,4,5,4))
lapply(1:ncol(data), function(i) aggregate(data$cost~., data[c(1, i)], sum))
#This gives the results below grouping just first column (team) against
#the others and getting a sum :
[[1]]
team team.1 data$cost
1 A A 13
2 B B 10
3 C C 4
[[2]]
team height data$cost
1 A short 4
2 B short 5
3 A tall 9
4 B tall 5
5 C tall 4
[[3]]
team size data$cost
1 A big 9
2 C big 4
3 A small 4
4 B small 10
[[4]]
team data$cost
1 A 13
2 B 10
3 C 4
What I want to avoid is having to manually replace the column number in the aggregate function, indicated by data[c(1, i)]
to get the next set of pairwise groupings. Again the actual data frame is much much larger and this would be tedious.
I tried the following code and attempted to create a list of lists that I could unnest:
all_comparisons <- lapply(1:ncol(data), function(i) aggregate(data$cost~.,
data[c(c(1:i), i)], sum))
huge_list_all_comparisons <- all_comparisons %>% bind_rows(all_comparisons) %>% # make larger sample data
mutate_if(is.list, simplify_all) %>% # flatten each list element internally
unnest()
>huge_list_all_comparisons
A tibble: 40 × 8
team team.1 `data$cost` height height.1 size size.1 cost.1
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 A A 13 NA NA NA NA NA
2 B B 10 NA NA NA NA NA
3 C C 4 NA NA NA NA NA
4 A NA 4 short short NA NA NA
5 B NA 5 short short NA NA NA
6 A NA 9 tall tall NA NA NA
7 B NA 5 tall tall NA NA NA
8 C NA 4 tall tall NA NA NA
9 A NA 4 short NA big big NA
10 A NA 5 tall NA big big NA
# … with 30 more rows
which returns cost sums of every single possible grouping, not just pairwise (in the real data set this would be prohibitive and lead to over a million rows of comparisons)
I would greatly appreciate any help getting some code that I could use to accomplish this pairwise group aggregation across the dataframe
CodePudding user response:
You can use combn()
to get the possible combinations of indices and then lapply()
over that.
library(tidyverse)
data |>
seq_along() |>
combn(2, simplify = F) |>
lapply(\(i) aggregate(data$cost~., data[c(i[1], i[2])], sum))
#> [[1]]
#> team height data$cost
#> 1 A short 4
#> 2 B short 5
#> 3 A tall 9
#> 4 B tall 5
#> 5 C tall 4
#>
#> [[2]]
#> team size data$cost
#> 1 A big 9
#> 2 C big 4
#> 3 A small 4
#> 4 B small 10
#>
#> [[3]]
#> team data$cost
#> 1 A 13
#> 2 B 10
#> 3 C 4
#>
#> [[4]]
#> height size data$cost
#> 1 short big 4
#> 2 tall big 9
#> 3 short small 5
#> 4 tall small 9
#>
#> [[5]]
#> height data$cost
#> 1 short 9
#> 2 tall 18
#>
#> [[6]]
#> size data$cost
#> 1 big 13
#> 2 small 14
Created on 2022-03-30 by the reprex package (v2.0.1)