I was trying to calculate the mean value by group for several variables. I wrote a loop to do that but it didn't give me what I want. Ideally, in the returned dataset, I'd want to have something like this:
# varA var1_mean var2_mean var3_mean
# 1 xx xx xx
# 2 xx xx xx
# 3 xx xx xx
# 4 xx xx xx
Here is my example code.
varA<-rep(c(1:4),times=30)
df1<-data.frame(varA)
df1$var1 <- sample(500:1000, length(df1$varA))
df1$var2 <- sample(500:1000, length(df1$varA))
df1$var3 <- sample(500:1000, length(df1$varA))
varlist<-c("var1", "var2", "var3")
for( varname in varlist) {
mean_var<-paste0(varname, "_mean")
df1_mean<- df1 %>%
group_by(varA) %>%
dplyr::summarise(mean_var=mean(.[[varname]], na.rm = TRUE))
}
Thanks in advance for the help!
CodePudding user response:
Instead of a loop, use dplyr::across()
to apply a summarise()
(or mutate()
) operation to multiple columns. You can also include multiple functions to be applied to each column.
set.seed(13)
library(dplyr)
df1 %>%
group_by(varA) %>%
summarise(across(
.cols = all_of(varlist),
.fns = list(
mean = ~ mean(.x, na.rm = TRUE),
q5 = ~ quantile(.x, .05, na.rm = TRUE),
q95 = ~ quantile(.x, .95, na.rm = TRUE)
)
))
Output:
# A tibble: 4 × 10
varA var1_m…¹ var1_q5 var1_…² var2_…³ var2_q5 var2_…⁴ var3_…⁵ var3_q5 var3_…⁶
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 748. 539. 971. 765. 546. 984. 720. 535. 945.
2 2 765. 570. 961. 745. 516. 955. 799. 594. 974.
3 3 738. 549. 946. 785. 571. 951. 711. 548. 969.
4 4 728. 510. 934. 715. 513. 982. 761. 534. 945.
# … with abbreviated variable names ¹var1_mean, ²var1_q95, ³var2_mean,
# ⁴var2_q95, ⁵var3_mean, ⁶var3_q95
Note that if you’re using a character vector of column names, as you are, you should wrap it in all_of()
or any_of()
as I did above. You can also use any other tidyselect syntax in across()
.
CodePudding user response:
For data.table
package, the solution may be as follows:
library(data.table)
varA<-rep(c(1:4),times=30)
df1<-data.table(varA)
df1$var1 <- sample(500:1000, length(df1$varA))
df1$var2 <- sample(500:1000, length(df1$varA))
df1$var3 <- sample(500:1000, length(df1$varA))
df1[, lapply(.SD, mean, na.rm = TRUE), by = 'varA']
Edit: for named variable
df1[
, c(
lapply(.SD, mean) %>% `names<-`(paste0('mean_var', 1:3)),
lapply(.SD, quantile, .05) %>% `names<-`(paste0('q5_var', 1:3)),
lapply(.SD, quantile, .95) %>% `names<-`(paste0('q95_var', 1:3))
),
by = .(varA)
]