Home > Enterprise >  Restructure data - get mean based on multiple grouping columns
Restructure data - get mean based on multiple grouping columns

Time:05-19

library(tidyverse)

set.seed(10)
dat <- data.frame(age = sample(14:79, size = 15, replace = TRUE),
                  sex = sample(c("m", "f"), size = 15, replace = TRUE),
           region = sample(c("A", "B", "C"), size = 15, replace = TRUE),
           var1 = runif(15, min = 0, max=100)) %>%
  mutate(agegrp = cut(age, breaks = c(-Inf, 20, 50, 70, Inf), labels = c("<= 20", "21-50", "51-70", ">70"))) 

The table looks like this:

   age sex region      var1 agegrp
1   79   m      A 64.669975    >70
2   78   m      C 92.789062    >70
3   23   m      A 69.626845  21-50
4   25   m      C  5.074013  21-50
5   60   f      C 10.340510  51-70
6   36   f      B 90.294240  21-50
7   23   m      A 12.769088  21-50
8   27   f      A 43.892321  21-50
9   35   f      B 99.793467  21-50
10  40   f      C 94.284903  21-50
11  25   m      A 98.829001  21-50
12  55   m      A 98.007185  51-70
13  43   f      A 37.491168  21-50
14  68   m      A 90.051414  51-70
15  76   f      B 13.567239    >70

Unfortunately, our customer needs the data to be in a pretty weird format like this:

  split  value  var1_mean
1 agegrp 21-50       61.3
2 agegrp 51-70       66.1
3 agegrp >70         57.0
4 sex    m           55.7
5 sex    f           66.5
6 region A           64.4
7 region B           67.9
8 region C           50.6
            

I can easily do this using the following code, but it is very unelegant:

age <- dat %>% group_by(agegrp) %>%
  summarise(var1_mean = mean(var1)) %>%
  mutate(value = agegrp,
         split = "agegrp") %>%
  select(split, value, var1_mean)

sex <- dat %>% group_by(sex) %>%
  summarise(var1_mean = mean(var1)) %>%
  mutate(value = "sex",
         split = "sex") %>%
  select(split, value, var1_mean)

region <- dat %>% group_by(region) %>%
  summarise(var1_mean = mean(var1)) %>%
  mutate(value = "region",
         split = "region") %>%
  select(split, value, var1_mean)

rbind(age, sex, region)

Is there a way to make this easier without "stacking" several tables manually (maybe using dplyr)?

CodePudding user response:

You can pivot to long and summarise or, alternatively, iterate over the vars of interest:

library(dplyr)
library(purrr)
library(tidyr)

dat %>%
  pivot_longer(-c(var1, age), names_to = "split") %>%
  group_by(split, value) %>%
  summarise(var1_mean = mean(var1))

Or:

map_df(set_names(c("agegrp", "sex", "region")), ~ dat %>% 
          group_by(across(.x)) %>% 
          summarise(var1_mean = mean(var1)) %>%
          rename(value = .x), .id = "split")

# A tibble: 8 × 3
  split  value var1_mean
  <chr>  <chr>     <dbl>
1 agegrp <= 20      54.6
2 agegrp 21-50      44.7
3 agegrp 51-70      46.4
4 sex    f          37.4
5 sex    m          55.5
6 region A          67.3
7 region B          47.9
8 region C          26.1

CodePudding user response:

Loop through columns and aggregate, then rowbind the results:

do.call(rbind, 
        lapply(c("agegrp", "sex", "region"), function(i){
          cbind(split = i,
                setNames(aggregate(as.formula(paste("age ~", i)), mean, data = dat), 
                         c("value", "var1_mean")))
        }))

#    split value var1_mean
# 1 agegrp <= 20  20.00000
# 2 agegrp 21-50  33.30000
# 3 agegrp 51-70  60.50000
# 4    sex     f  39.62500
# 5    sex     m  39.71429
# 6 region     A  35.75000
# 7 region     B  39.00000
# 8 region     C  43.60000

Note: output is different, probably because the seeded data and shown data do not match.

  • Related