I have a data frame of multiple variables for for different years, that looks kind of like this:
df <- data.frame(name=c("name1", "name2", "name3", "name4"),
X1990=c(1,6,8,NA),
X1990.1=c(10,20,NA,2),
X1990.2=c(2,4,6,8),
X1990.3=c(1,NA,3,6),
X1990.4=c(8,7,5,4),
X1991=c(2,6,3,5),
X1991.1=c(NA,20,NA,2),
X1991.2=c(NA,NA,NA,NA),
X1991.3=c(1,NA,3,5),
X1991.4=c(8,9,6,3))
I made this example with only 5 variables per year and with only 2 year, but in reality is a much larger df, with tens of variables for the years 1990 to 2020.
I want to create a new dataframe with the sums all the columns for the same year, so that the new data frame looks like this:
df_sum <- data.frame(name=c("name1", "name2", "name3", "name4"),
X1990=c(22, 37, 22, 20),
X1991=c(11,35,12,15))
I was thinking some loop over rowSums(across(matches('pattern')), na.rm = TRUE)
that I found on another questions, but so far have not been successful to implement.
Thanks!
CodePudding user response:
We can reshape to 'long' format with pivot_longer
, and get the sum
while reshaping back to 'wide'
library(dplyr)
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = starts_with("X"), names_to = "name1") %>%
mutate(name1 = str_remove(name1, "\\.\\d $")) %>%
pivot_wider(names_from = name1, values_from = value,
values_fn = ~ sum(.x, na.rm = TRUE))
-output
# A tibble: 4 × 3
name X1990 X1991
<chr> <dbl> <dbl>
1 name1 22 11
2 name2 37 35
3 name3 22 12
4 name4 20 15
Or in base R
, use split.default
to split the data into a list
of datasets based on the column name pattern, get the rowSums
and cbind
with the first column
cbind(df[1], sapply(split.default(df[-1],
trimws(names(df)[-1], whitespace = "\\.\\d ")), rowSums, na.rm = TRUE))
name X1990 X1991
1 name1 22 11
2 name2 37 35
3 name3 22 12
4 name4 20 15