Home > Enterprise >  Create mean dataframe from a list of dataframes
Create mean dataframe from a list of dataframes

Time:11-10

I've got a list of more than 6000 dataframes with about 6000 observations of 2000 variables. I would like to calculate a mean dataframe of 6000x2000 in which each value is the mean of that same position across all dataframes (as example: mean.df[1,1] = mean(df1[1,1],df2[1,1],df3[1,1],.....dfN[1,1])

I've prepared a short example to illustrate what I need

# Build list of dataframes
df.X <- data.frame("A1" = seq(1,5,1), "A2" = seq(2,10,2) , "A3" = seq(10,18,2))
df.Y <- data.frame("B1" = seq(0,8,2), "B2" = seq(4,8,1) , "B3" = seq(10,18,2))
df.Z <- data.frame("C1" = seq(2,10,2), "C2" = 1:5 , "C3" = seq(10,18,2))
df.list <- list(df.X,df.Y,df.Z)

#Desired output
m1  M2  M3 
1.0 2.3 10.0
2.7 3.7 12.0
4.3 5.0 14.0
6.0 6.3 16.0
7.7 7.7 18.0

I have made some attempts with purrr with reduce and map but had no success. Mabye apply?

Thanks in advance

CodePudding user response:

Base R
Sum all elements of the list together, and divide by number of elements. Does not work with NA-elements!

Reduce(` `, df.list) / length(df.list)

        A1       A2 A3
1 1.000000 2.333333 10
2 2.666667 3.666667 12
3 4.333333 5.000000 14
4 6.000000 6.333333 16
5 7.666667 7.666667 18

All there is left to do is setting the colnames.

CodePudding user response:

A tidyverse solution. We can combine all data frames, create an ID column by each Group, and calculate the mean. This solution is not as succinct as Wimpel's solution, but it can handle missing values by setting na.rm = TRUE in the summarize and across call.

library(tidyverse)

df_mean <- df.list %>%
  map_dfr(~setNames(.x, paste0("A", 1:ncol(.x))), .id = "Group") %>%
  group_by(Group) %>%
  mutate(ID = 1:n()) %>%
  group_by(ID) %>%
  summarize(across(-Group, .fns = mean, na.rm = TRUE)) %>%
  select(-ID)
# # A tibble: 5 x 3
#    A1    A2    A3
#    <dbl> <dbl> <dbl>
# 1  1     2.33    10
# 2  2.67  3.67    12
# 3  4.33  5       14
# 4  6     6.33    16
# 5  7.67  7.67    18

CodePudding user response:

An option is also to convert to array and use colMeans

apply(array(unlist(df.list), c(dim(df.list[[1]]), length(df.list))), 2, rowMeans)

-output

          [,1]     [,2] [,3]
[1,] 1.000000 2.333333   10
[2,] 2.666667 3.666667   12
[3,] 4.333333 5.000000   14
[4,] 6.000000 6.333333   16
[5,] 7.666667 7.666667   18
  • Related