Home > Back-end >  loop to return mean value by group
loop to return mean value by group

Time:10-27

I was trying to calculate the mean value by group for several variables. I wrote a loop to do that but it didn't give me what I want. Ideally, in the returned dataset, I'd want to have something like this:

# varA var1_mean var2_mean var3_mean
# 1          xx    xx        xx
# 2          xx    xx        xx
# 3          xx    xx        xx
# 4          xx    xx        xx

Here is my example code.

varA<-rep(c(1:4),times=30)
df1<-data.frame(varA)
df1$var1 <- sample(500:1000, length(df1$varA))
df1$var2 <- sample(500:1000, length(df1$varA))
df1$var3 <- sample(500:1000, length(df1$varA))

varlist<-c("var1", "var2", "var3")

for( varname in varlist) {
  mean_var<-paste0(varname, "_mean")
  df1_mean<- df1 %>%
    group_by(varA) %>%
    dplyr::summarise(mean_var=mean(.[[varname]], na.rm = TRUE))
}

Thanks in advance for the help!

CodePudding user response:

Instead of a loop, use dplyr::across() to apply a summarise() (or mutate()) operation to multiple columns. You can also include multiple functions to be applied to each column.

set.seed(13)
library(dplyr)

df1 %>%
  group_by(varA) %>%
  summarise(across(
    .cols = all_of(varlist),
    .fns = list(
      mean = ~ mean(.x, na.rm = TRUE), 
      q5 = ~ quantile(.x, .05, na.rm = TRUE), 
      q95 = ~ quantile(.x, .95, na.rm = TRUE)
    )
  ))

Output:

# A tibble: 4 × 10
   varA var1_m…¹ var1_q5 var1_…² var2_…³ var2_q5 var2_…⁴ var3_…⁵ var3_q5 var3_…⁶
  <int>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1     1     748.    539.    971.    765.    546.    984.    720.    535.    945.
2     2     765.    570.    961.    745.    516.    955.    799.    594.    974.
3     3     738.    549.    946.    785.    571.    951.    711.    548.    969.
4     4     728.    510.    934.    715.    513.    982.    761.    534.    945.
# … with abbreviated variable names ¹​var1_mean, ²​var1_q95, ³​var2_mean,
#   ⁴​var2_q95, ⁵​var3_mean, ⁶​var3_q95

Note that if you’re using a character vector of column names, as you are, you should wrap it in all_of() or any_of() as I did above. You can also use any other tidyselect syntax in across().

CodePudding user response:

For data.table package, the solution may be as follows:

library(data.table)

varA<-rep(c(1:4),times=30)
df1<-data.table(varA)
df1$var1 <- sample(500:1000, length(df1$varA))
df1$var2 <- sample(500:1000, length(df1$varA))
df1$var3 <- sample(500:1000, length(df1$varA))

df1[, lapply(.SD, mean, na.rm = TRUE), by = 'varA']

Edit: for named variable

df1[
  , c(
    lapply(.SD, mean) %>% `names<-`(paste0('mean_var', 1:3)), 
    lapply(.SD, quantile, .05) %>% `names<-`(paste0('q5_var', 1:3)),
    lapply(.SD, quantile, .95) %>% `names<-`(paste0('q95_var', 1:3))
    ), 
  by = .(varA)
]
  • Related