Aggregate including all levels?-CodePudding

I have an R data frame with factor and numeric columns, and I would like to calculate certain summaries of the numeric columns by various groupings of the factors. In particular, I would like to be able to summarise by multiple combinations of the factors at once, and return the results in a single object.

For example, say I'm using the warpbreaks data frame, which has columns breaks (integer), wool (factor with levels "A", "B") and tension (factor with levels "L", "M" and "H"). If I want to get the average number of breaks for each combination of wool and tension, I know I can use aggregate(breaks ~ wool tension, data = warpbreaks, mean) and it will give me something like:

  wool tension   breaks
1    A       L 44.55556
2    B       L 28.22222
...
6    B       H 18.77778

But I'd like to also calculate the means across just wool, and just tension, and for the dataset as a whole, and return something like:

  wool tension   breaks
1   NA      NA 24.14815
2   NA       L 36.38889
...
5    A      NA 31.03704
...
7    A       L 44.55556
...
12   B       H 18.77778

I tried a few variations of formulas in the aggregate function and couldn't find anything suitable, is this something that can be done simply?

CodePudding user response：

If you have an arbitrary number of inputs that you want to treat as a whole individually, you can set up a loop via Map:

## set formula inputs
rhs <- c("wool","tension")
lhs <- "breaks"

## map it
Map(
  \(d,l,r) aggregate(d[l], d[r], FUN=mean),
  list(warpbreaks),
  list(lhs),
  c(list(rhs), rhs)
)

#[[1]]
#  wool tension   breaks
#1    A       L 44.55556
#2    B       L 28.22222
#3    A       M 24.00000
#4    B       M 28.77778
#5    A       H 24.55556
#6    B       H 18.77778
#
#[[2]]
#  wool   breaks
#1    A 31.03704
#2    B 25.25926
#
#[[3]]
#  tension   breaks
#1       L 36.38889
#2       M 26.38889
#3       H 21.66667

You can extend the rhs inputs however you see fit for many combinations, e.g.:

rhs <- 1:3
c(list(rhs), combn(rhs, 2, simplify=FALSE), rhs)
#[[1]]
#[1] 1 2 3
#
#[[2]]
#[1] 1 2
#
#[[3]]
#[1] 1 3
#
#[[4]]
#[1] 2 3
#
#[[5]]
#[1] 1
#
#[[6]]
#[1] 2
#
#[[7]]
#[1] 3

CodePudding user response：

You can calculate aggregate separately and combine them with bind_rows.

dplyr::bind_rows(aggregate(breaks ~ wool   tension, data = warpbreaks, mean),
                 aggregate(breaks ~ wool, data = warpbreaks, mean),
                 aggregate(breaks ~ tension, data = warpbreaks, mean))

#   wool  tension breaks
#   <fct> <fct>    <dbl>
# 1 A     L       44.556
# 2 A     M       24    
# 3 A     H       24.556
# 4 B     L       28.222
# 5 B     M       28.778
# 6 B     H       18.778
# 7 A     NA      31.037
# 8 B     NA      25.259
# 9 NA    L       36.389
#10 NA    M       26.389
#11 NA    H       21.667

Or in dplyr -

library(dplyr)

bind_rows(
  warpbreaks %>% group_by(wool, tension) %>% summarise(breaks = mean(breaks), .groups = 'drop'), 
  warpbreaks %>% group_by(wool) %>% summarise(breaks = mean(breaks)),
  warpbreaks %>% group_by(tension) %>% summarise(breaks = mean(breaks))
  )

For any number of inputs.

library(dplyr)
library(purrr)

cols <- c('wool', 'tension')

map_df(seq_along(cols), function(x) combn(cols, x, function(y) {
  warpbreaks %>% group_by(across(all_of(y))) %>% summarise(breaks = mean(breaks))
}, simplify = FALSE) %>% bind_rows())