From the tapply
function I am obtaining a list of data frames, each of them with columns A=c(x_1, x_2, ..., x_n)
and B=c(y_1, y_2, ..., y_n)
where x_i, y_i
are always numeric.
What is important is to know that A
is always the same for each data frame, while B
varies.
I want to get the average y_i
for each x_i
in all these dataframes. The obvious way to do it is to join them:
joined <- bind_rows(list_of_dfs)
aggregate(joined$B, list(joined$A), mean)
But joining is quite costly and makes the program rather slow. The size of the data I handle makes efficiency important. How could I perform the aggregation with the mean function without joining the data frames?
Reproducible example:
A = 1:20
X = rnorm(20)
Y = rnorm(20)
Z = rnorm(20)
# Note that only B differs across dfs
df1 <- data.frame(A=A, B=X)
df2 <- data.frame(A=A, B=Y)
df3 <- data.frame(A=A, B=Z)
list_of_dfs <- list(df1, df2, df3)
# Obvious solution involves joining:
joined <- bind_rows(list_of_dfs)
# Compute mean grouped by A
aggregate(joined$B, list(joined$A), mean)
CodePudding user response:
If the values in the A
column are unique in the single dataset and this column is replicated in each of the datasets in the list
, an easier option is to extract ([[
) only the column 'B' from the list
, get the elementwise sum (
) with Reduce
and divide by the length
of the list
to get the elementwise mean
out1 <- data.frame(A = list_of_dfs[[1]]$A, B = Reduce(" ",
lapply(list_of_dfs, "[[", 'B'))/length(list_of_dfs))
In case if there are NA
elements, the above method may return NA
for that element, so, we may use rowMeans
instead which also can remove the NA
s with na.rm
argument
out2 <- data.frame(A = list_of_dfs[[1]]$A,
B = rowMeans(sapply(list_of_dfs, "[[", "B"), na.rm = TRUE))
-checking
> all.equal(out1, out2)
[1] TRUE