Home > Mobile >  sum across multiple columns of a data frame based on multiple patterns R
sum across multiple columns of a data frame based on multiple patterns R

Time:04-15

I have a data frame of multiple variables for for different years, that looks kind of like this:

df <- data.frame(name=c("name1", "name2", "name3", "name4"),
                X1990=c(1,6,8,NA),
                X1990.1=c(10,20,NA,2),
                X1990.2=c(2,4,6,8),
                X1990.3=c(1,NA,3,6),
                X1990.4=c(8,7,5,4),
                X1991=c(2,6,3,5),
                X1991.1=c(NA,20,NA,2),
                X1991.2=c(NA,NA,NA,NA),
                X1991.3=c(1,NA,3,5),
                X1991.4=c(8,9,6,3))

I made this example with only 5 variables per year and with only 2 year, but in reality is a much larger df, with tens of variables for the years 1990 to 2020.

I want to create a new dataframe with the sums all the columns for the same year, so that the new data frame looks like this:

df_sum <- data.frame(name=c("name1", "name2", "name3", "name4"),
                     X1990=c(22, 37, 22, 20),
                     X1991=c(11,35,12,15))

I was thinking some loop over rowSums(across(matches('pattern')), na.rm = TRUE) that I found on another questions, but so far have not been successful to implement.

Thanks!

CodePudding user response:

We can reshape to 'long' format with pivot_longer, and get the sum while reshaping back to 'wide'

library(dplyr)
library(tidyr)
library(stringr)
df %>%
   pivot_longer(cols = starts_with("X"), names_to = "name1") %>% 
   mutate(name1 = str_remove(name1, "\\.\\d $")) %>% 
   pivot_wider(names_from = name1, values_from = value, 
      values_fn = ~ sum(.x, na.rm = TRUE))

-output

# A tibble: 4 × 3
  name  X1990 X1991
  <chr> <dbl> <dbl>
1 name1    22    11
2 name2    37    35
3 name3    22    12
4 name4    20    15

Or in base R, use split.default to split the data into a list of datasets based on the column name pattern, get the rowSums and cbind with the first column

cbind(df[1], sapply(split.default(df[-1], 
  trimws(names(df)[-1], whitespace = "\\.\\d ")), rowSums, na.rm = TRUE))
   name X1990 X1991
1 name1    22    11
2 name2    37    35
3 name3    22    12
4 name4    20    15
  • Related