Home > database >  Rowwise sum of subset of columns, column names are stored in another column
Rowwise sum of subset of columns, column names are stored in another column

Time:01-13

Maybe the title doesn't make it too clear, but I didn't know how to formulate it better. I develop the question:

I am working in r with a data frame like the following but much larger:

df <- data.frame(ref = c("01","02","03","04","05"), 
                         var1 = c(2,3,6,8,5),
                         var2 = c(3,8,1,5,7),
                         var3 = c(1,1,4,5,6),
                         columns_to_sum = c("var1, var2", 
                                            "var1, var3", 
                                            "var2, var3",
                                            "var1, var2, var3",
                                            "var1"))

What I want to get is a new column that is the result of the sum of the values of the columns whose name is in the column: "columns_to_sum". Thus, the resulting data frame would be similar to :

enter image description here

Just in case I add the result to the reproducible example in the lines of code below:

df <- data.frame(ref = c("01","02","03","04","05", 
                 var1 = c(2,3,6,8,5),
                 var2 = c(3,8,1,5,7),
                 var3 = c(1,1,4,5,6),
                 columns_to_sum = c("var1, var2", 
                    "var1, var3", 
                    "var2, var3",
                    "var1, var2, var3",
                    "var1"),
                 result_column = c(5,4,5,18,5)
                 )

I have used the dplyr::select function inside the rowSums function a couple of times, for example:

df_rs <- df %>% mutate(
  result_column = rowSums(dplyr::select(., matches("var")))
  )

However, I cannot find a way to select for the sum the columns that are included in the column: "columns_to_sum".

Any idea?

Thank you very much in advance!

CodePudding user response:

Loop through rows, get column names - strsplit, and sum:

cbind(df, 
      result = sapply(seq(nrow(df)), function(i){
        cols <- unlist(strsplit(df$columns_to_sum[ i ], ", ", fixed = TRUE))
        sum(df[i, cols, drop = FALSE])
      }))

#   ref var1 var2 var3   columns_to_sum result
# 1  01    2    3    1       var1, var2      5
# 2  02    3    8    1       var1, var3      4
# 3  03    6    1    4       var2, var3      5
# 4  04    8    5    5 var1, var2, var3     18
# 5  05    5    7    6             var1      5

CodePudding user response:

Here's another idea.

library(purrr)
library(dplyr)
df |>
    mutate(df, result_column = {
        idx <- strsplit(columns_to_sum, ", ")
        vars <- reduce(idx, union)
        (do.call(rbind,
                map(idx, ~vars %in% .x)) *
            df[vars]) |>
            rowSums()
    })

#>   ref var1 var2 var3   columns_to_sum result_column
#> 1  01    2    3    1       var1, var2             5
#> 2  02    3    8    1       var1, var3             4
#> 3  03    6    1    4       var2, var3             5
#> 4  04    8    5    5 var1, var2, var3            18
#> 5  05    5    7    6             var1             5
  • Related