Grouped mean across list of dataframes in R-CodePudding

From the tapply function I am obtaining a list of data frames, each of them with columns A=c(x_1, x_2, ..., x_n) and B=c(y_1, y_2, ..., y_n) where x_i, y_i are always numeric.

What is important is to know that A is always the same for each data frame, while B varies.

I want to get the average y_i for each x_i in all these dataframes. The obvious way to do it is to join them:

joined <- bind_rows(list_of_dfs)
aggregate(joined$B, list(joined$A), mean)

But joining is quite costly and makes the program rather slow. The size of the data I handle makes efficiency important. How could I perform the aggregation with the mean function without joining the data frames?

Reproducible example:

A = 1:20
X = rnorm(20)
Y = rnorm(20)
Z = rnorm(20)

# Note that only B differs across dfs
df1 <- data.frame(A=A, B=X)
df2 <- data.frame(A=A, B=Y)
df3 <- data.frame(A=A, B=Z)

list_of_dfs <- list(df1, df2, df3)

# Obvious solution involves joining:

joined <- bind_rows(list_of_dfs)
# Compute mean grouped by A
aggregate(joined$B, list(joined$A), mean)

CodePudding user response：

If the values in the A column are unique in the single dataset and this column is replicated in each of the datasets in the list, an easier option is to extract ([[) only the column 'B' from the list, get the elementwise sum ( ) with Reduce and divide by the length of the list to get the elementwise mean

out1 <- data.frame(A = list_of_dfs[[1]]$A, B = Reduce(" ",
         lapply(list_of_dfs, "[[", 'B'))/length(list_of_dfs))

In case if there are NA elements, the above method may return NA for that element, so, we may use rowMeans instead which also can remove the NAs with na.rm argument

out2 <- data.frame(A = list_of_dfs[[1]]$A, 
     B = rowMeans(sapply(list_of_dfs, "[[", "B"), na.rm = TRUE))

-checking

> all.equal(out1, out2)
[1] TRUE