I have an R data frame with factor and numeric columns, and I would like to calculate certain summaries of the numeric columns by various groupings of the factors. In particular, I would like to be able to summarise by multiple combinations of the factors at once, and return the results in a single object.
For example, say I'm using the warpbreaks
data frame, which has columns breaks (integer), wool (factor with levels "A", "B") and tension (factor with levels "L", "M" and "H"). If I want to get the average number of breaks for each combination of wool and tension, I know I can use aggregate(breaks ~ wool tension, data = warpbreaks, mean)
and it will give me something like:
wool tension breaks
1 A L 44.55556
2 B L 28.22222
...
6 B H 18.77778
But I'd like to also calculate the means across just wool, and just tension, and for the dataset as a whole, and return something like:
wool tension breaks
1 NA NA 24.14815
2 NA L 36.38889
...
5 A NA 31.03704
...
7 A L 44.55556
...
12 B H 18.77778
I tried a few variations of formulas in the aggregate function and couldn't find anything suitable, is this something that can be done simply?
CodePudding user response:
If you have an arbitrary number of inputs that you want to treat as a whole individually, you can set up a loop via Map
:
## set formula inputs
rhs <- c("wool","tension")
lhs <- "breaks"
## map it
Map(
\(d,l,r) aggregate(d[l], d[r], FUN=mean),
list(warpbreaks),
list(lhs),
c(list(rhs), rhs)
)
#[[1]]
# wool tension breaks
#1 A L 44.55556
#2 B L 28.22222
#3 A M 24.00000
#4 B M 28.77778
#5 A H 24.55556
#6 B H 18.77778
#
#[[2]]
# wool breaks
#1 A 31.03704
#2 B 25.25926
#
#[[3]]
# tension breaks
#1 L 36.38889
#2 M 26.38889
#3 H 21.66667
You can extend the rhs
inputs however you see fit for many combinations, e.g.:
rhs <- 1:3
c(list(rhs), combn(rhs, 2, simplify=FALSE), rhs)
#[[1]]
#[1] 1 2 3
#
#[[2]]
#[1] 1 2
#
#[[3]]
#[1] 1 3
#
#[[4]]
#[1] 2 3
#
#[[5]]
#[1] 1
#
#[[6]]
#[1] 2
#
#[[7]]
#[1] 3
CodePudding user response:
You can calculate aggregate
separately and combine them with bind_rows
.
dplyr::bind_rows(aggregate(breaks ~ wool tension, data = warpbreaks, mean),
aggregate(breaks ~ wool, data = warpbreaks, mean),
aggregate(breaks ~ tension, data = warpbreaks, mean))
# wool tension breaks
# <fct> <fct> <dbl>
# 1 A L 44.556
# 2 A M 24
# 3 A H 24.556
# 4 B L 28.222
# 5 B M 28.778
# 6 B H 18.778
# 7 A NA 31.037
# 8 B NA 25.259
# 9 NA L 36.389
#10 NA M 26.389
#11 NA H 21.667
Or in dplyr
-
library(dplyr)
bind_rows(
warpbreaks %>% group_by(wool, tension) %>% summarise(breaks = mean(breaks), .groups = 'drop'),
warpbreaks %>% group_by(wool) %>% summarise(breaks = mean(breaks)),
warpbreaks %>% group_by(tension) %>% summarise(breaks = mean(breaks))
)
For any number of inputs.
library(dplyr)
library(purrr)
cols <- c('wool', 'tension')
map_df(seq_along(cols), function(x) combn(cols, x, function(y) {
warpbreaks %>% group_by(across(all_of(y))) %>% summarise(breaks = mean(breaks))
}, simplify = FALSE) %>% bind_rows())