I am trying to perform a calculation that is fairly simple in Excel. But I am having difficulty in figuring out a way to do the same in R.
This is the data:
structure(list(Industry = c("A ", "B", "C", "A ", "B", "C", "A ",
"B", "C", "A ", "B", "C"), Date = c("06-01-2022", "06-01-2022",
"06-01-2022", "07-01-2022", "07-01-2022", "07-01-2022", "08-01-2022",
"08-01-2022", "08-01-2022", "09-01-2022", "09-01-2022", "09-01-2022"
), Value = c(43496, 159927, 42428, 44895, 162891, 43091, NA,
NA, NA, NA, NA, NA), Growth = c(NA, NA, NA, NA, NA, NA, 0.05,
0.04, 0.03, 0.01, 0.02, 0.06)), row.names = c(NA, 12L), class = "data.frame")
I want to create the missing values where value in a given month for a group is calculated by multiplying the growth rate with the previous month's value. so for month 8, the value in month 7 is multiplied with growth rate and for month 9, the value of month 8 is multiplied with the growth rate in Growth column.
Output should look like this:
structure(list(Industry = c("A ", "B", "C", "A ", "B", "C", "A ",
"B", "C", "A ", "B", "C"), Date = c("06-01-2022", "06-01-2022",
"06-01-2022", "07-01-2022", "07-01-2022", "07-01-2022", "08-01-2022",
"08-01-2022", "08-01-2022", "09-01-2022", "09-01-2022", "09-01-2022"
), Value = c(43496, 159927, 42428, 44895, 162891, 43091, 47139.75,
169406.64, 44383.73, 47611.1475, 172794.7728, 47046.7538), Growth = c(NA,
NA, NA, NA, NA, NA, 0.05, 0.04, 0.03, 0.01, 0.02, 0.06)), row.names = c(NA,
12L), class = "data.frame")
Is there any neat dplyr/data.table solution for this?
CodePudding user response:
Here is a two-step data.table
approach in which we first carry the Growth
value backward in time (for convenience) and then make use of Reduce()
setting accumulate = TRUE
to carry the Value
column forward in time:
library(data.table)
setDT(dat)
## carry growth column backward
dat[is.na(Growth), Growth := Value / shift(Value, type = "lag", fill = Value[1]) - 1, by = "Industry"]
## carry value column forward
dat[, Value := Reduce(\(val, gro) (1 gro) * val, x = Growth, init = Value[1], accumulate = TRUE)[-1], by = "Industry"]
#> Industry Date Value Growth
#> 1: A 06-01-2022 43496.00 0.00000000
#> 2: B 06-01-2022 159927.00 0.00000000
#> 3: C 06-01-2022 42428.00 0.00000000
#> 4: A 07-01-2022 44895.00 0.03216388
#> 5: B 07-01-2022 162891.00 0.01853346
#> 6: C 07-01-2022 43091.00 0.01562647
#> 7: A 08-01-2022 47139.75 0.05000000
#> 8: B 08-01-2022 169406.64 0.04000000
#> 9: C 08-01-2022 44383.73 0.03000000
#> 10: A 09-01-2022 47611.15 0.01000000
#> 11: B 09-01-2022 172794.77 0.02000000
#> 12: C 09-01-2022 47046.75 0.06000000
Data
dat <- structure(list(Industry = c("A ", "B", "C", "A ", "B", "C", "A ",
"B", "C", "A ", "B", "C"), Date = c("06-01-2022", "06-01-2022",
"06-01-2022", "07-01-2022", "07-01-2022", "07-01-2022", "08-01-2022",
"08-01-2022", "08-01-2022", "09-01-2022", "09-01-2022", "09-01-2022"
), Value = c(43496, 159927, 42428, 44895, 162891, 43091, NA,
NA, NA, NA, NA, NA), Growth = c(NA, NA, NA, NA, NA, NA, 0.05,
0.04, 0.03, 0.01, 0.02, 0.06)), row.names = c(NA, 12L), class = "data.frame")
CodePudding user response:
Data table answer is way better, I just don't understand the syntax of data.table.
# original data
df <- structure(list(Industry = c("A ", "B", "C", "A ", "B", "C", "A ",
"B", "C", "A ", "B", "C"), Date = c("06-01-2022", "06-01-2022",
"06-01-2022", "07-01-2022", "07-01-2022", "07-01-2022", "08-01-2022",
"08-01-2022", "08-01-2022", "09-01-2022", "09-01-2022", "09-01-2022"
), Value = c(43496, 159927, 42428, 44895, 162891, 43091, NA,
NA, NA, NA, NA, NA), Growth = c(NA, NA, NA, NA, NA, NA, 0.05,
0.04, 0.03, 0.01, 0.02, 0.06)), row.names = c(NA, 12L), class = "data.frame")
# basic cleaning of names, industry, arrangement
df <- df %>%
janitor::clean_names() %>%
mutate(industry = str_trim(industry)) %>%
arrange(industry, date) %>%
group_by(industry)
# create a custom function
replace_na_growth <- function(df){
warning("must be grouped dataframe")
# count the maximum number of consecutive NA by industry
count_na <- unique(df %>%
mutate(n = cumsum(is.na(value))) %>%
filter(n == max(n)) %>%
pull(n))
df <- df %>%
mutate(growth = ifelse(is.na(growth), 1, 1 growth))
# write a for loop that runs through the max cumulative sum of NA --------
#' if value is NA, then take value(lag by 1) * it by growth
#' you have to do this multiple times to get it to keep replacing consecutive na's
for(i in 1:count_na){
df<- df %>%
group_by(industry) %>%
mutate(value = ifelse(is.na(value), lag(value, 1)*growth, value))
}
return(df)
}
df %>%
replace_na_growth()