Home > Software engineering >  Idiomatic dplyr and/or data.table way to get group means and grand means "idiomatically" i
Idiomatic dplyr and/or data.table way to get group means and grand means "idiomatically" i

Time:04-01

I use both dplyr and data.table so I am interested if there is a solution using either package.

I commonly want to summarize data with summary statistics -- let's say mean for the sake of example -- both by some grouping variable and across the entire dataset (grand mean). Then I typically combine them into a single output dataframe for display, with the grand mean row designated "total" or "overall" in the grouping variable column.

Here is how I usually do that, using both dplyr and data.table:

dplyr

library(dplyr)

d <- tibble(grp = rep(letters[1:3], 10), v = 1:30)

group_means <- d %>%
  group_by(grp) %>%
  summarize(v_mean = mean(v))
grand_means <- d %>%
  summarize(v_mean = mean(v)) %>%
  mutate(grp = 'overall')

bind_rows(group_means, grand_means)

data.table

library(data.table)

d <- data.table(grp = rep(letters[1:3], 10), v = 1:30)

group_means <- d[, .(v_mean = mean(v)), by = .(grp)]
grand_means <- d[, .(v_mean = mean(v))]
grand_means[, grp := 'overall']

rbindlist(list(group_means, grand_means), use.names = TRUE)

My issue is that this isn't very concise. It's not that bad in this example, but if I have to calculate a large number of summary statistics, I have to repeat the same code twice. My question is, is there an idiomatic and concise way to get both grouped and overall summary statistics in either dplyr or data.table?

CodePudding user response:

Here's a concise way with dplyr:

d %>%
  add_row(grp = 'overall', v = mean(.$v)) %>%
  group_by(grp) %>%
  summarise(mean_v = mean(v))

Another option, to avoid repeating summary stats calculation twice:

d %>%
  bind_rows(mutate(., grp = 'overall')) %>%
  group_by(grp) %>%
  summarise(mean_v = mean(v))

CodePudding user response:

I removed my previous answer as I found this neat data.table function

data.table::cube(d, mean(v), by = c("grp"))

This gives you (sub)totals of your groups

    grp   V1
1:    a 14.5
2:    b 15.5
3:    c 16.5
4: <NA> 15.5

Including replacing NA and the correct column name

data.table::cube(d, .(v_mean = mean(v)), by = c("grp"))[is.na(grp), grp := "overall"][]

       grp v_mean
1:       a   14.5
2:       b   15.5
3:       c   16.5
4: overall   15.5

More info can be found here: https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/groupingsets

  • Related