For loop for dataframes inside a function-CodePudding

I'm trying to write a function in R that takes a list of dataframes, performs operations on various columns, and returns a dataframe of the results (each column is named as the dataframe used). A simplistic example is as follows:

df1 <- data.frame(
 a = c(1, 2, 3),
 b = c(2, 3, 4),
 c = c(5, 6, 7))

df2 <- data.frame(
 a = c(9, 8, 7),
 b = c(5, 1, 1),
 c = c(6, 6, 7))

myfunct <- function(listofdfs){
   results_df <- data.frame(rownames = 'meanA', 'maxA', 'maxB', 'sum')
   
   for (i in 1:length(listofdfs)) {
    mymean <- mean(listofdfs[i]$a)
    mymaxA <- max(listofdfs[i]$a)
    mymaxB <- max(listofdfs[i]$b)
    mysum <- mymean   mymaxA



   newcol <- c(mymean, mymaxA, mymaxB, mysum)
   
   results_df[, ncol(results_df)   1] <- newcol
   colnames(results_df)[ncol(results_df)] <- listofdfs[i]
   
   }
       
results_df
}

Where calling

myfunct(list(df1, df2))

would give this output:

	df1	df2
meanA	2	8
maxA	3	9
maxB	4	5
sum	5	17

I get errors every time I try to make it work, and specifically right now I'm getting an error saying that replacement has 4 rows and data has 1.

Is there a better way to build this type of function than with a for loop? The real function I'm building is more complex than just taking the mean, max, and sum of a few digits, but this dummy should get the point across.

CodePudding user response：

Here is a base R approach.

list_of_df <- list(df1 = df1, df2 = df2)
f <- function(list_of_df) {
  f1 <- function(df) {
    meanA <- mean(df$a)
    maxA <- max(df$a)
    maxB <- max(df$b)
    c(meanA = meanA, maxA = maxA, maxB = maxB, sum = meanA   maxA) 
  }
  as.data.frame(lapply(list_of_df, f1))
}
f(list_of_df)
#       df1 df2
# meanA   2   8
# maxA    3   9
# maxB    4   5
# sum     5  17

A few remarks:

If you want the variables in the result to be named df1 and df2, then you need the elements of your list of data frames to be named accordingly. list(value1, value2) is a list without names. list(name1 = value1, name2 = value2) is a list with names c("name1", "name2").
This implementation presupposes that all of your summary statistics are numeric. It concatenates all of the summary statistics for a given data frame using c, and the elements of the resulting atomic vector are constrained to have a common type.
Internally, lapply is still looping over your list of data frames. Using lapply is much cleaner than implementing the loop yourself, but lapply may not perform significantly better than an equivalent loop. If you are actually worried about performance, then you may need to describe the structure of your actual data in more detail.

FWIW, the reason you are seeing the replacement error is this:

data.frame(rownames = 'meanA', 'maxA', 'maxB', 'sum')
#   rownames X.maxA. X.maxB. X.sum.
# 1    meanA    maxA    maxB    sum

You initialized a data frame with one row, and you were trying to append a length-4 vector. Perhaps what you intended was:

data.frame(row.names = c("meanA", "maxA", "maxB", "sum"))
# data frame with 0 columns and 4 rows

CodePudding user response：

You could use a purrr approach instead:

list_of_df <- list(df1 = df1, df2 = df2)

library(dplyr)
library(tidyr)
library(purrr)

list_of_df %>% 
  map_df(~.x %>% 
           summarise(mymean = mean(a),
                     mymaxA = max(a),
                     mymaxB = max(b),
                     mysum  = mymean   mymaxA),
         .id = "name") %>% 
  pivot_longer(-name,
               names_to = "var",
               values_to = "value") %>% 
  pivot_wider()

This returns

# A tibble: 4 x 3
  var      df1   df2
  <chr>  <dbl> <dbl>
1 mymean     2     8
2 mymaxA     3     9
3 mymaxB     4     5
4 mysum      5    17

You could wrap it into a function like

myfunct <- function(listofdfs){
  listofdfs %>% 
    map_df(~.x %>% 
             summarise(mymean = mean(a),
                       mymaxA = max(a),
                       mymaxB = max(b),
                       mysum  = mymean   mymaxA),
           .id = "name") %>% 
    pivot_longer(-name,
                 names_to = "var",
                 values_to = "value") %>% 
    pivot_wider()
}

myfunct(list_of_df)

returning the same output.