I have a dataframe where there are many, many NA values interspersed among the columns of interest. Here is a sample:
dput(try[, 30:34])
structure(list(adj_avg_2016 = c(25641966.4, 28919805, 44152549.8,
4341304.9, 10588244.8, 4928398.6, 8985362.4, 832000, 2674875.2,
20416846.8), adj_avg_2017 = c(27693728, 29815155, 46187854.8,
4594589.9, 11021124.8, 5075010, 9436240.8, 848640, 2782821.6,
21439365.4), adj_avg_2018 = c(31554678.4, 30710505, 45952664,
NA, NA, NA, NA, 865920, 2894427.2, 21556703.6), adj_avg_2019 = c(33454832,
31605855, 45590832, NA, 12319764.8, NA, NA, 900480, NA,
21556703.6), adj_avg_2020 = c(35122520, 32501205, 46495412, NA,
NA, NA, NA, NA, 3130445.6, 22520553.1)), class = "data.frame", row.names = c(NA,
10L))
What I would like to do is sum each of the columns in the dataframe try
based on a condition using the column index.
More specifically, I would like to look at the previous year column and sum using only the elements that are non-NA in both the current and previous year column. For example
- For col
adj_avg_2016
, I want to sum all elements because it is the first column in the group - For col
adj_avg_2017
, I want to sum all elements because there are no NA values in it or in theadj_avg_2016
column. - For col
adj_avg_2018
, I want to sum using only the corresponding entries that aren'tNA
in eitheradj_avg_2017
oradj_avg_2018
(so entries 1:3, 8:10) - For col
adj_avg_2019
, I want to sum using entries 1:3, 8 & 10 because entry 5 inadj_avg_2018
is NA and entry 9 inadj_avg_2019
is NA - For col
adj_avg_2020
I would like to sum using only entries 1:3 & 10 because entry 9 is empty inadj_avg_2019
.
My desired result are columns for each year, agg_pass1_2016
...agg_pass1_2020
containing the specified calculations.
Since the dataframe will continue to grow each year and I will be doing this quite a bit a function seems like the way to go. Here is what I came up with so far:
str <- function(df, overwrite = TRUE){
nms <- grep("adj_avg_[0-9] $", names(df), value = TRUE) # get the columns
nms_col_idx <- grep("adj_avg_[0-9] $", names(df), value = FALSE) # get the column indexes
revnms <- gsub("adj_avg", "agg_pass1", nms) # names of new columns I want to add
if(!overwrite) revnms <- setdiff(revnms, names(df))
df[, revnms] <- lapply(df[, nms], sum, ....) # This is where I get confused.
}
My problems/questions:
What is the best way to incorporate the row and column indexes into
df[, revnms]
in the above function to achieve the desired calculations? I know one way would involve some combination ofwhich(!is.na(.))
but am unsure of how to put that into my function in the last row.Is there another function that could achieve my results as quickly and simply as a function?
All comments/suggestions are welcome. Thanks a lot.
CodePudding user response:
Based on the logic, we may need to multiply with the logical matrix created with the subset of data without the first column and the last column to and get the colSums
colSums(cbind(try[1], try[-1] * (!is.na(try[-1]) &
!is.na(try[-ncol(try)]))), na.rm = TRUE)
-output
adj_avg_2016 adj_avg_2017 adj_avg_2018 adj_avg_2019 adj_avg_2020
151481354 158894530 133534898 133108703 136639690
Or using tidyverse
library(dplyr)
library(stringr)
try %>%
summarise(across(everything(),
~ if(cur_column() == 'adj_avg_2016') sum(., na.rm = TRUE)
else sum(.x[!is.na(.x) &
!is.na(get(str_replace(cur_column(), "(\\d )$",
function(x) as.numeric(x) - 1)))])))
adj_avg_2016 adj_avg_2017 adj_avg_2018 adj_avg_2019 adj_avg_2020
1 151481354 158894530 133534898 133108703 136639690
If we want to create new columns, use mutate
instead of summarise
and also make use of .names
try %>%
mutate(across(everything(),
~ if(cur_column() == 'adj_avg_2016') sum(., na.rm = TRUE)
else sum(.x[!is.na(.x) &
!is.na(get(str_replace(cur_column(), "(\\d )$",
function(x) as.numeric(x) - 1)))]),
.names = "{str_replace(.col, 'avg', 'pass')}"))
-output
adj_avg_2016 adj_avg_2017 adj_avg_2018 adj_avg_2019 adj_avg_2020 adj_pass_2016 adj_pass_2017 adj_pass_2018 adj_pass_2019 adj_pass_2020
1 25641966 27693728 31554678 33454832 35122520 151481354 158894530 133534898 133108703 136639690
2 28919805 29815155 30710505 31605855 32501205 151481354 158894530 133534898 133108703 136639690
3 44152550 46187855 45952664 45590832 46495412 151481354 158894530 133534898 133108703 136639690
4 4341305 4594590 NA NA NA 151481354 158894530 133534898 133108703 136639690
5 10588245 11021125 NA 12319765 NA 151481354 158894530 133534898 133108703 136639690
6 4928399 5075010 NA NA NA 151481354 158894530 133534898 133108703 136639690
7 8985362 9436241 NA NA NA 151481354 158894530 133534898 133108703 136639690
8 832000 848640 865920 900480 NA 151481354 158894530 133534898 133108703 136639690
9 2674875 2782822 2894427 NA 3130446 151481354 158894530 133534898 133108703 136639690
10 20416847 21439365 21556704 21556704 22520553 151481354 158894530 133534898 133108703 136639690