Home > database >  Add column to data frame with multiple conditions and formulas based on earlier rows
Add column to data frame with multiple conditions and formulas based on earlier rows

Time:10-14

So I have data wrangling question.

Here is my example data.

year <- c(2019, 2019, 2020, 2020, 2021, 2021, 2021)
sn <- c("AB1001", "AB1002", "AB1001", "DC1001", "AB1002", "AB1001", "AB1003")
hours <- c(150, 173, 189, 102, 175, 215, 98)
delta_hours <- c(150, 173, 39, 102, NA, NA, NA)
df <- data.frame(year, sn, hours, delta_hours)

which produces this data frame:

year     sn hours delta_hours
1 2019 AB1001   150         150
2 2019 AB1002   173         173
3 2020 AB1001   189          39
4 2020 DC1001   102         102
5 2021 AB1002   175          NA
6 2021 AB1001   215          NA
7 2021 AB1003    98          NA

What I am needing to do is mutate/update the delta_hours column for 2021. Specifically, for sn's that are unique (haven't been in previous years) I want to be able to simply copy over the information from the hours column. For those sn's that have appeared in previous years, I want to subtract the hours from the 2021 year by the hours in whatever the most recent year was. So for sn AB1001, I would want to take 215 - 189 in order to get 26. For sn AB1003 I want to simply copy over the 98. And of course for any year prior to 2021, I wan to simply keep the information.

My final data frame should basically look like this:

year     sn hours delta_hours
1 2019 AB1001   150         150
2 2019 AB1002   173         173
3 2020 AB1001   189          39
4 2020 DC1001   102         102
5 2021 AB1002   175           2
6 2021 AB1001   215          26
7 2021 AB1003    98          98

I'm assuming I need to use case_when, and I can do that to get the data from prior years to simply copy over, but I'm not getting how to do the subtraction properly.

Help appreciated! Thanks!

CodePudding user response:

library(dplyr)

df %>% 
  group_by(sn) %>% 
  mutate(
    delta_hours = if_else(year == 2021, hours - lag(hours), delta_hours),
    delta_hours = if_else(is.na(delta_hours), hours, delta_hours)
  )

# A tibble: 7 x 4
# Groups:   sn [4]
   year sn     hours delta_hours
  <dbl> <chr>  <dbl>       <dbl>
1  2019 AB1001   150         150
2  2019 AB1002   173         173
3  2020 AB1001   189          39
4  2020 DC1001   102         102
5  2021 AB1002   175           2
6  2021 AB1001   215          26
7  2021 AB1003    98          98
  • Related