Home > Software engineering >  Sum numeric sub-dataframe within a list
Sum numeric sub-dataframe within a list

Time:09-20

Here I have a r list of dataframes, all dataframes are in the same format and have the same dimensionality, the first 2 columns are strings,like IDs and names(identical for all dataframes), and the rest are numeric values. Here I want to sum numeric parts of all the dataframes in matrix way, i.e. output at index (1,3) is the sum of values at index (1,3) of all the dataframes

e.g. Given list L consist of dataframe x and y, I want to get output like z

x <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1,2),v2=c(3,4))
y <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(5,6),v2=c(7,8))
L <- list(x,y)
z <- data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1 5,2 6),v2=c(3 7,4 8))

I know how to do this using for loop, but I want to learn to do it in a more R-like way, by that I mean, using some vectorized functions, like the apply family

Currently my idea is create a new dataframe with only ID and name columns, then use a global dataframe variable to sum the numeric parts, and at last cbind this 2 parts

output <- x[,1:2]
num_sum <- matrix(0,nrow=nrow(L[[1]]),ncol=ncol(L[[1]][,-c(1,2)]))
lapply(L,function(a){num_sum <<- a[3:length(a)] num_sum})
cbind(output,num_sum)

but this approach has some problems I prefer to avoid

  1. I need to manully set the 2 parts of output and then manully join the two parts
  2. lapply() will return a list that each element is an intermiediate num_sum returned by an iteration, which requires much more memory space
  3. Here I'm using the global variable num_sum to keep track of the progress, but num_sum is not needed later and I have to manully remove it later

CodePudding user response:

If the order of the two first columns is always the same, you can do:

#Get all numeric columns
num <- sapply(L[[1]], is.numeric)

#Sum them across elements of the list
df_num <- Reduce(` `, lapply(L, `[`, num))

#Get the non-numeric columns and bind them with sum of numeric columns
cbind(L[[1]][!num], df_num)

output

 ID name v1 v2
1 aa   cc  6 10
2 bb   dd  8 12

If they are different you can use powerjoin to do an inner join on the selected columns and sum the rest with conflict argument:

library(powerjoin)
sum_inner_join <- 
  function(x, y) power_inner_join(x, y, by = c("ID", "name"), conflict = ~ .x   .y)
Reduce(sum_inner_join, L)

output

  ID name v1 v2
1 aa   cc  6 10
2 bb   dd  8 12

CodePudding user response:

using dplyr and purrr (which has a bit nicer map functions), you could do something like this:

library(purrr)
library(dplyr)
result <- reduce(L, function(x,y){
    xVals <- x |> select(-ID, -name)
    yVals <- y |> select(-ID, -name)

    totalVals <- xVals |> map2(yVals, function(x,y) {
    rowSums(cbind(x,y))
    })

    return(x |> select(ID, name) |> cbind(totalVals))
})

CodePudding user response:

Similar logic to Maël's answer, but squishing it all into a Map call:

data.frame(do.call(Map, 
   c(\(...) if(is.numeric(..1)) Reduce(` `, list(...)) else ..1,  L)
))
#  ID name v1 v2
#1 aa   cc  6 10
#2 bb   dd  8 12

If the first ..1 chunk of the column is numeric, sum all the values in all the lists, otherwise return the first ..1 chunk.

You could also do it via an aggregation if all the rows are unique:

tmp <- do.call(rbind, L)
nums <- sapply(tmp, is.numeric)
aggregate(tmp[nums], tmp[!nums], FUN=sum)
#  ID name v1 v2
#1 aa   cc  6 10
#2 bb   dd  8 12
  • Related