I need to sum the values for about 40 variables by the same group.
This is an example dataset. So I wanted to sum the values of score1-score5 by region and department.
region <- rep(c("south", "east", "west", "north"),times=10)
department <- rep(c("A", "B","C","D","E"),times=8)
score1 <- rnorm(n = 40, mean = 0, sd = 1)
score2 <-rnorm(n = 40, mean = 3, sd = 1.5)
score3 <-rnorm(n = 40, mean = 2, sd = 1)
score4 <-rnorm(n = 40, mean = 1, sd = 1.5)
score5 <-rnorm(n = 40, mean = 5, sd = 1.5)
df <- data.frame(region, department, score1, score2, score3, score4, score5)
This is the code that would lead to the resutls I wanted to have but is there any easier ways to do this:
df %>% group_by(region, department) %>%
summarise(score1=sum(score1),
score2=sum(score2),
score3=sum(score3),
score4=sum(score4),
score5=sum(score5))
I tried to use a loop but this didn't work:
vlist<-c("score1", "score2", "score3", "score4", "score5")
for (var in vlist) {
df<-df %>% group_by(region, department) %>%
summarise(var=sum(.[[var]]))
}
Is there any other ways or what is wrong with my loop? Thanks!
CodePudding user response:
Use across
- loop across
the columns that starts_with
'score' and get the sum
library(dplyr)
out1 <- df %>%
group_by(region, department) %>%
summarise(across(starts_with('score'), sum), .groups = 'drop')
In the for
loop, the issue is that df
is getting updated (df <-..
) in each iteration and summarise
returns only the columns provided in the group by and the summarised output. Thus, after the first iteration, 'df' wouldn't have the 'score' columns at all. If we want to use a for
loop, get the output in a list
and then reduce
with a join
library(purrr)
out_list <- vector('list', length(vlist))
names(out_list) <- vlist
for (var in vlist) {
out_list[[var]] <- df %>%
group_by(region, department) %>%
summarise(!!var := sum(cur_data()[[var]]), .groups = 'drop')
}
out2 <- reduce(out_list, full_join, by = c('region', 'department'))
-checking the outputs
> identical(out1, out2)
[1] TRUE