I'm trying to write a function in R that takes a list of dataframes, performs operations on various columns, and returns a dataframe of the results (each column is named as the dataframe used). A simplistic example is as follows:
df1 <- data.frame(
a = c(1, 2, 3),
b = c(2, 3, 4),
c = c(5, 6, 7))
df2 <- data.frame(
a = c(9, 8, 7),
b = c(5, 1, 1),
c = c(6, 6, 7))
myfunct <- function(listofdfs){
results_df <- data.frame(rownames = 'meanA', 'maxA', 'maxB', 'sum')
for (i in 1:length(listofdfs)) {
mymean <- mean(listofdfs[i]$a)
mymaxA <- max(listofdfs[i]$a)
mymaxB <- max(listofdfs[i]$b)
mysum <- mymean mymaxA
newcol <- c(mymean, mymaxA, mymaxB, mysum)
results_df[, ncol(results_df) 1] <- newcol
colnames(results_df)[ncol(results_df)] <- listofdfs[i]
}
results_df
}
Where calling
myfunct(list(df1, df2))
would give this output:
df1 | df2 | |
---|---|---|
meanA | 2 | 8 |
maxA | 3 | 9 |
maxB | 4 | 5 |
sum | 5 | 17 |
I get errors every time I try to make it work, and specifically right now I'm getting an error saying that replacement has 4 rows and data has 1.
Is there a better way to build this type of function than with a for loop? The real function I'm building is more complex than just taking the mean, max, and sum of a few digits, but this dummy should get the point across.
CodePudding user response:
Here is a base R approach.
list_of_df <- list(df1 = df1, df2 = df2)
f <- function(list_of_df) {
f1 <- function(df) {
meanA <- mean(df$a)
maxA <- max(df$a)
maxB <- max(df$b)
c(meanA = meanA, maxA = maxA, maxB = maxB, sum = meanA maxA)
}
as.data.frame(lapply(list_of_df, f1))
}
f(list_of_df)
# df1 df2
# meanA 2 8
# maxA 3 9
# maxB 4 5
# sum 5 17
A few remarks:
- If you want the variables in the result to be named
df1
anddf2
, then you need the elements of your list of data frames to be named accordingly.list(value1, value2)
is a list without names.list(name1 = value1, name2 = value2)
is a list with namesc("name1", "name2")
. - This implementation presupposes that all of your summary statistics are numeric. It concatenates all of the summary statistics for a given data frame using
c
, and the elements of the resulting atomic vector are constrained to have a common type. - Internally,
lapply
is still looping over your list of data frames. Usinglapply
is much cleaner than implementing the loop yourself, butlapply
may not perform significantly better than an equivalent loop. If you are actually worried about performance, then you may need to describe the structure of your actual data in more detail.
FWIW, the reason you are seeing the replacement error is this:
data.frame(rownames = 'meanA', 'maxA', 'maxB', 'sum')
# rownames X.maxA. X.maxB. X.sum.
# 1 meanA maxA maxB sum
You initialized a data frame with one row, and you were trying to append a length-4 vector. Perhaps what you intended was:
data.frame(row.names = c("meanA", "maxA", "maxB", "sum"))
# data frame with 0 columns and 4 rows
CodePudding user response:
You could use a purrr
approach instead:
list_of_df <- list(df1 = df1, df2 = df2)
library(dplyr)
library(tidyr)
library(purrr)
list_of_df %>%
map_df(~.x %>%
summarise(mymean = mean(a),
mymaxA = max(a),
mymaxB = max(b),
mysum = mymean mymaxA),
.id = "name") %>%
pivot_longer(-name,
names_to = "var",
values_to = "value") %>%
pivot_wider()
This returns
# A tibble: 4 x 3
var df1 df2
<chr> <dbl> <dbl>
1 mymean 2 8
2 mymaxA 3 9
3 mymaxB 4 5
4 mysum 5 17
You could wrap it into a function like
myfunct <- function(listofdfs){
listofdfs %>%
map_df(~.x %>%
summarise(mymean = mean(a),
mymaxA = max(a),
mymaxB = max(b),
mysum = mymean mymaxA),
.id = "name") %>%
pivot_longer(-name,
names_to = "var",
values_to = "value") %>%
pivot_wider()
}
myfunct(list_of_df)
returning the same output.