Home > Mobile >  How to fill in date using dplyr?
How to fill in date using dplyr?

Time:04-13

I have data that look like this, and I need to fill in the missing NA values for DATE.

      ID   DAY  TIME DATE      
   <dbl> <dbl> <dbl> <date>    
 1     1     1     1 NA        
 2     1     1     2 NA        
 3     1     1     3 NA        
 4     1     1     4 NA        
 5     1     1     5 NA        
 6     1     2     1 2021-09-25
 7     1     2     2 2021-09-25
 8     1     2     3 2021-09-25
 9     1     2     4 2021-09-25
10     1     2     5 2021-09-25
11     1     3     1 NA        
12     1     3     2 NA        
13     1     3     3 NA        
14     1     3     4 NA        
15     1     3     5 NA        
16     2     1     1 2022-02-26
17     2     1     2 2022-02-26
18     2     1     3 2022-02-26
19     2     1     4 2022-02-26
20     2     1     5 2022-02-26
21     2     2     1 NA        
22     2     2     2 2022-02-27
23     2     2     3 2022-02-27
24     2     2     4 2022-02-27
25     2     2     5 2022-02-27

The DATE value that corresponds with DAY is different for each ID. The final dataset should look like this:

# A tibble: 25 × 4
      ID   DAY  TIME DATE      
   <dbl> <dbl> <dbl> <chr>     
 1     1     1     1 2021-09-24
 2     1     1     2 2021-09-24
 3     1     1     3 2021-09-24
 4     1     1     4 2021-09-24
 5     1     1     5 2021-09-24
 6     1     2     1 2021-09-25
 7     1     2     2 2021-09-25
 8     1     2     3 2021-09-25
 9     1     2     4 2021-09-25
10     1     2     5 2021-09-25
11     1     3     1 2021-09-26
12     1     3     2 2021-09-26
13     1     3     3 2021-09-26
14     1     3     4 2021-09-26
15     1     3     5 2021-09-26
16     2     1     1 2022-02-26
17     2     1     2 2022-02-26
18     2     1     3 2022-02-26
19     2     1     4 2022-02-26
20     2     1     5 2022-02-26
21     2     2     1 2022-02-27
22     2     2     2 2022-02-27
23     2     2     3 2022-02-27
24     2     2     4 2022-02-27
25     2     2     5 2022-02-27

CodePudding user response:

One approach is to determine a "zero" date based on known dates, and use fill within a given ID that share a common reference date. Then, you can use the reference date plus DAY to determine the final DATE.

library(tidyverse)

df %>%
  mutate(DATE = as.Date(DATE),
         DATE0 = DATE - DAY) %>%
  group_by(ID) %>%
  fill(DATE0, .direction = "updown") %>%
  mutate(DATE = DATE0   DAY) %>%
  select(-DATE0)

Output

      ID   DAY  TIME DATE      
   <int> <int> <int> <date>    
 1     1     1     1 2021-09-24
 2     1     1     2 2021-09-24
 3     1     1     3 2021-09-24
 4     1     1     4 2021-09-24
 5     1     1     5 2021-09-24
 6     1     2     1 2021-09-25
 7     1     2     2 2021-09-25
 8     1     2     3 2021-09-25
 9     1     2     4 2021-09-25
10     1     2     5 2021-09-25
11     1     3     1 2021-09-26
12     1     3     2 2021-09-26
13     1     3     3 2021-09-26
14     1     3     4 2021-09-26
15     1     3     5 2021-09-26
16     2     1     1 2022-02-26
17     2     1     2 2022-02-26
18     2     1     3 2022-02-26
19     2     1     4 2022-02-26
20     2     1     5 2022-02-26
21     2     2     1 2022-02-27
22     2     2     2 2022-02-27
23     2     2     3 2022-02-27
24     2     2     4 2022-02-27
25     2     2     5 2022-02-27

Data

df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
    DAY = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
    3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), TIME = c(1L, 
    2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 
    2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), DATE = c(NA, NA, NA, 
    NA, NA, "2021-09-25", "2021-09-25", "2021-09-25", "2021-09-25", 
    "2021-09-25", NA, NA, NA, NA, NA, "2022-02-26", "2022-02-26", 
    "2022-02-26", "2022-02-26", "2022-02-26", NA, "2022-02-27", 
    "2022-02-27", "2022-02-27", "2022-02-27")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
"25"))

CodePudding user response:

Here's a cheeky way to do it with lm.

dat$dateFilled <- as.Date(predict(lm(DATE~DAY ID, dat), newdata = dat), origin = "1970-01-01")

> dat
   ID DAY TIME       DATE dateFilled
1   1   1    1       <NA> 2021-09-24
2   1   1    2       <NA> 2021-09-24
3   1   1    3       <NA> 2021-09-24
4   1   1    4       <NA> 2021-09-24
5   1   1    5       <NA> 2021-09-24
6   1   2    1 2021-09-25 2021-09-25
7   1   2    2 2021-09-25 2021-09-25
8   1   2    3 2021-09-25 2021-09-25
9   1   2    4 2021-09-25 2021-09-25
10  1   2    5 2021-09-25 2021-09-25
11  1   3    1       <NA> 2021-09-26
12  1   3    2       <NA> 2021-09-26
13  1   3    3       <NA> 2021-09-26
14  1   3    4       <NA> 2021-09-26
15  1   3    5       <NA> 2021-09-26
16  2   1    1 2022-02-26 2022-02-26
17  2   1    2 2022-02-26 2022-02-26
18  2   1    3 2022-02-26 2022-02-26
19  2   1    4 2022-02-26 2022-02-26
20  2   1    5 2022-02-26 2022-02-26
21  2   2    1       <NA> 2022-02-27
22  2   2    2 2022-02-27 2022-02-27
23  2   2    3 2022-02-27 2022-02-27
24  2   2    4 2022-02-27 2022-02-27
25  2   2    5 2022-02-27 2022-02-27
  • Related