Home > Enterprise >  Calculating missing values in R iteratively based on values in previous rows in grouped data
Calculating missing values in R iteratively based on values in previous rows in grouped data

Time:10-04

I am trying to perform a calculation that is fairly simple in Excel. But I am having difficulty in figuring out a way to do the same in R.

This is the data:

structure(list(Industry = c("A ", "B", "C", "A ", "B", "C", "A ", 
"B", "C", "A ", "B", "C"), Date = c("06-01-2022", "06-01-2022", 
"06-01-2022", "07-01-2022", "07-01-2022", "07-01-2022", "08-01-2022", 
"08-01-2022", "08-01-2022", "09-01-2022", "09-01-2022", "09-01-2022"
), Value = c(43496, 159927, 42428, 44895, 162891, 43091, NA, 
NA, NA, NA, NA, NA), Growth = c(NA, NA, NA, NA, NA, NA, 0.05, 
0.04, 0.03, 0.01, 0.02, 0.06)), row.names = c(NA, 12L), class = "data.frame")

I want to create the missing values where value in a given month for a group is calculated by multiplying the growth rate with the previous month's value. so for month 8, the value in month 7 is multiplied with growth rate and for month 9, the value of month 8 is multiplied with the growth rate in Growth column.

Output should look like this:

structure(list(Industry = c("A ", "B", "C", "A ", "B", "C", "A ", 
"B", "C", "A ", "B", "C"), Date = c("06-01-2022", "06-01-2022", 
"06-01-2022", "07-01-2022", "07-01-2022", "07-01-2022", "08-01-2022", 
"08-01-2022", "08-01-2022", "09-01-2022", "09-01-2022", "09-01-2022"
), Value = c(43496, 159927, 42428, 44895, 162891, 43091, 47139.75, 
169406.64, 44383.73, 47611.1475, 172794.7728, 47046.7538), Growth = c(NA, 
NA, NA, NA, NA, NA, 0.05, 0.04, 0.03, 0.01, 0.02, 0.06)), row.names = c(NA, 
12L), class = "data.frame")

Is there any neat dplyr/data.table solution for this?

CodePudding user response:

Here is a two-step data.table approach in which we first carry the Growth value backward in time (for convenience) and then make use of Reduce() setting accumulate = TRUE to carry the Value column forward in time:

library(data.table)

setDT(dat)

## carry growth column backward
dat[is.na(Growth), Growth := Value / shift(Value, type = "lag", fill = Value[1]) - 1, by = "Industry"]

## carry value column forward
dat[, Value := Reduce(\(val, gro) (1   gro) * val, x = Growth, init = Value[1], accumulate = TRUE)[-1], by = "Industry"]

#>     Industry       Date     Value     Growth
#>  1:       A  06-01-2022  43496.00 0.00000000
#>  2:        B 06-01-2022 159927.00 0.00000000
#>  3:        C 06-01-2022  42428.00 0.00000000
#>  4:       A  07-01-2022  44895.00 0.03216388
#>  5:        B 07-01-2022 162891.00 0.01853346
#>  6:        C 07-01-2022  43091.00 0.01562647
#>  7:       A  08-01-2022  47139.75 0.05000000
#>  8:        B 08-01-2022 169406.64 0.04000000
#>  9:        C 08-01-2022  44383.73 0.03000000
#> 10:       A  09-01-2022  47611.15 0.01000000
#> 11:        B 09-01-2022 172794.77 0.02000000
#> 12:        C 09-01-2022  47046.75 0.06000000

Data

dat <- structure(list(Industry = c("A ", "B", "C", "A ", "B", "C", "A ", 
"B", "C", "A ", "B", "C"), Date = c("06-01-2022", "06-01-2022", 
"06-01-2022", "07-01-2022", "07-01-2022", "07-01-2022", "08-01-2022", 
"08-01-2022", "08-01-2022", "09-01-2022", "09-01-2022", "09-01-2022"
), Value = c(43496, 159927, 42428, 44895, 162891, 43091, NA, 
NA, NA, NA, NA, NA), Growth = c(NA, NA, NA, NA, NA, NA, 0.05, 
0.04, 0.03, 0.01, 0.02, 0.06)), row.names = c(NA, 12L), class = "data.frame")

CodePudding user response:

Data table answer is way better, I just don't understand the syntax of data.table.

# original data
df <- structure(list(Industry = c("A ", "B", "C", "A ", "B", "C", "A ",
                            "B", "C", "A ", "B", "C"), Date = c("06-01-2022", "06-01-2022",
                                                                "06-01-2022", "07-01-2022", "07-01-2022", "07-01-2022", "08-01-2022",
                                                                "08-01-2022", "08-01-2022", "09-01-2022", "09-01-2022", "09-01-2022"
                            ), Value = c(43496, 159927, 42428, 44895, 162891, 43091, NA,
                                         NA, NA, NA, NA, NA), Growth = c(NA, NA, NA, NA, NA, NA, 0.05,
                                                                         0.04, 0.03, 0.01, 0.02, 0.06)), row.names = c(NA, 12L), class = "data.frame")

# basic cleaning of names, industry, arrangement

df <- df %>% 
  janitor::clean_names() %>%
  mutate(industry = str_trim(industry)) %>%
  arrange(industry, date)  %>%
  group_by(industry)

# create a custom function

replace_na_growth <- function(df){

  warning("must be grouped dataframe")
# count the maximum number of consecutive NA by industry 
  
  count_na <- unique(df %>%
  mutate(n = cumsum(is.na(value))) %>%
  filter(n == max(n)) %>% 
  pull(n))


  df <- df %>%
    mutate(growth = ifelse(is.na(growth), 1, 1   growth))
    
# write a for loop that runs through the max cumulative sum of NA --------

  #' if value is NA, then take value(lag by 1) * it by growth
  #' you have to do this multiple times to get it to keep replacing consecutive na's
  
  for(i in 1:count_na){ 
    
    df<- df %>%
      group_by(industry) %>%
      mutate(value = ifelse(is.na(value), lag(value, 1)*growth, value))
  }
  
return(df)
}
         

df %>%
  replace_na_growth()
  • Related