Home > OS >  Loop a merge sum function on a set of dataframes in R
Loop a merge sum function on a set of dataframes in R

Time:05-24

I have the following list of dataset :

dflist <- list(df1_A, df1_B, df1_C, df1_D, df1_E,
               df2_A, df2_B, df2_C, df2_D, df2_E,
               df3_A, df3_B, df3_C, df3_D, df3_E,
               df4_A, df4_B, df4_C, df4_D, df4_E)

names(dflist) <- c("df1_A", "df1_B", "df1_C", "df1_D", "df1_E",
                   "df2_A", "df2_B", "df2_C", "df2_D", "df2_E",
                   "df3_A", "df3_B", "df3_C", "df3_D", "df3_E",
                   "df4_A", "df4_B", "df4_C", "df4_D", "df4_E")

Each dataframe have the same structure (with the same column's names):

df1_A
    V1  V2
G18941  17
G20092 534
G19692  10
G19703 260
G16777 231
G20045   0
...

I would like to make a function that merges all the dataframes with the same number (but different letter) in my list and sums the values in column V2 when the names in V1 are the same.

In hard, I managed to do this for df1_A and df1_B with the following code:

newdf <- bind_rows(df1_A, df1_B) %>%
  group_by(V1) %>%
  summarise_all(., sum, na.rm = TRUE)

I can easily turn this into a function like this:

MergeAndSum <- function(df1,df2) 
newdf <- bind_rows(df1, df2) %>%
  group_by(V1) %>%
  summarise_all(., sum, na.rm = TRUE)
return(newdf)

But I don't really see how to call it to do the loop. I try something like :

for (i in 2:length(dflist)){
  df1 <- List_RawCounts_Files[i-1]
  df2 <- List_RawCounts_Files[i]
  out1 <- MergeAndSum(df1,df2)
return(out1)
}

I imagine something that merges sum the df1_A to the df1_B and reassigns the result to df1_A, then calls back the function with df1_A and df1_C and reassigns the result to df1_A, then calls back the function with df1_A and df1_D, and reassigns the result to df1_A, then calls back the function with df1_A and df1_E

Then the same thing with df2 (df2_A, df2_B,... df2_E), then df3, df4 and df5.

If you know how to do this I am listening. Thanks

CodePudding user response:

bind_rows can combine list of dataframes together. You can combine them with the id column so that the name of the list is added as a new column, extract the dataframe name (df1 from df1_A, df2 from df2_A and so on) and take the sum of V2 column for each dataframe and V1 column as group.

library(dplyr)

bind_rows(dflist, .id = "id") %>%
  mutate(id = stringr::str_extract(id, 'df\\d ')) %>%
  group_by(id, V1) %>%
  summarise(V2 = sum(V2, na.rm = TRUE), .groups = "drop")

Since you want to sum only one column (V2) you can use summarise instead of summarise_all which has been superseded.

  • Related