I use both dplyr
and data.table
so I am interested if there is a solution using either package.
I commonly want to summarize data with summary statistics -- let's say mean for the sake of example -- both by some grouping variable and across the entire dataset (grand mean). Then I typically combine them into a single output dataframe for display, with the grand mean row designated "total" or "overall" in the grouping variable column.
Here is how I usually do that, using both dplyr
and data.table
:
dplyr
library(dplyr)
d <- tibble(grp = rep(letters[1:3], 10), v = 1:30)
group_means <- d %>%
group_by(grp) %>%
summarize(v_mean = mean(v))
grand_means <- d %>%
summarize(v_mean = mean(v)) %>%
mutate(grp = 'overall')
bind_rows(group_means, grand_means)
data.table
library(data.table)
d <- data.table(grp = rep(letters[1:3], 10), v = 1:30)
group_means <- d[, .(v_mean = mean(v)), by = .(grp)]
grand_means <- d[, .(v_mean = mean(v))]
grand_means[, grp := 'overall']
rbindlist(list(group_means, grand_means), use.names = TRUE)
My issue is that this isn't very concise. It's not that bad in this example, but if I have to calculate a large number of summary statistics, I have to repeat the same code twice. My question is, is there an idiomatic and concise way to get both grouped and overall summary statistics in either dplyr or data.table?
CodePudding user response:
Here's a concise way with dplyr:
d %>%
add_row(grp = 'overall', v = mean(.$v)) %>%
group_by(grp) %>%
summarise(mean_v = mean(v))
Another option, to avoid repeating summary stats calculation twice:
d %>%
bind_rows(mutate(., grp = 'overall')) %>%
group_by(grp) %>%
summarise(mean_v = mean(v))
CodePudding user response:
I removed my previous answer as I found this neat data.table function
data.table::cube(d, mean(v), by = c("grp"))
This gives you (sub)totals of your groups
grp V1
1: a 14.5
2: b 15.5
3: c 16.5
4: <NA> 15.5
Including replacing NA
and the correct column name
data.table::cube(d, .(v_mean = mean(v)), by = c("grp"))[is.na(grp), grp := "overall"][]
grp v_mean
1: a 14.5
2: b 15.5
3: c 16.5
4: overall 15.5
More info can be found here: https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/groupingsets