I have data that look like this, and I need to fill in the missing NA values for DATE
.
ID DAY TIME DATE
<dbl> <dbl> <dbl> <date>
1 1 1 1 NA
2 1 1 2 NA
3 1 1 3 NA
4 1 1 4 NA
5 1 1 5 NA
6 1 2 1 2021-09-25
7 1 2 2 2021-09-25
8 1 2 3 2021-09-25
9 1 2 4 2021-09-25
10 1 2 5 2021-09-25
11 1 3 1 NA
12 1 3 2 NA
13 1 3 3 NA
14 1 3 4 NA
15 1 3 5 NA
16 2 1 1 2022-02-26
17 2 1 2 2022-02-26
18 2 1 3 2022-02-26
19 2 1 4 2022-02-26
20 2 1 5 2022-02-26
21 2 2 1 NA
22 2 2 2 2022-02-27
23 2 2 3 2022-02-27
24 2 2 4 2022-02-27
25 2 2 5 2022-02-27
The DATE
value that corresponds with DAY
is different for each ID
. The final dataset should look like this:
# A tibble: 25 × 4
ID DAY TIME DATE
<dbl> <dbl> <dbl> <chr>
1 1 1 1 2021-09-24
2 1 1 2 2021-09-24
3 1 1 3 2021-09-24
4 1 1 4 2021-09-24
5 1 1 5 2021-09-24
6 1 2 1 2021-09-25
7 1 2 2 2021-09-25
8 1 2 3 2021-09-25
9 1 2 4 2021-09-25
10 1 2 5 2021-09-25
11 1 3 1 2021-09-26
12 1 3 2 2021-09-26
13 1 3 3 2021-09-26
14 1 3 4 2021-09-26
15 1 3 5 2021-09-26
16 2 1 1 2022-02-26
17 2 1 2 2022-02-26
18 2 1 3 2022-02-26
19 2 1 4 2022-02-26
20 2 1 5 2022-02-26
21 2 2 1 2022-02-27
22 2 2 2 2022-02-27
23 2 2 3 2022-02-27
24 2 2 4 2022-02-27
25 2 2 5 2022-02-27
CodePudding user response:
One approach is to determine a "zero" date based on known dates, and use fill
within a given ID
that share a common reference date. Then, you can use the reference date plus DAY
to determine the final DATE
.
library(tidyverse)
df %>%
mutate(DATE = as.Date(DATE),
DATE0 = DATE - DAY) %>%
group_by(ID) %>%
fill(DATE0, .direction = "updown") %>%
mutate(DATE = DATE0 DAY) %>%
select(-DATE0)
Output
ID DAY TIME DATE
<int> <int> <int> <date>
1 1 1 1 2021-09-24
2 1 1 2 2021-09-24
3 1 1 3 2021-09-24
4 1 1 4 2021-09-24
5 1 1 5 2021-09-24
6 1 2 1 2021-09-25
7 1 2 2 2021-09-25
8 1 2 3 2021-09-25
9 1 2 4 2021-09-25
10 1 2 5 2021-09-25
11 1 3 1 2021-09-26
12 1 3 2 2021-09-26
13 1 3 3 2021-09-26
14 1 3 4 2021-09-26
15 1 3 5 2021-09-26
16 2 1 1 2022-02-26
17 2 1 2 2022-02-26
18 2 1 3 2022-02-26
19 2 1 4 2022-02-26
20 2 1 5 2022-02-26
21 2 2 1 2022-02-27
22 2 2 2 2022-02-27
23 2 2 3 2022-02-27
24 2 2 4 2022-02-27
25 2 2 5 2022-02-27
Data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
DAY = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), TIME = c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), DATE = c(NA, NA, NA,
NA, NA, "2021-09-25", "2021-09-25", "2021-09-25", "2021-09-25",
"2021-09-25", NA, NA, NA, NA, NA, "2022-02-26", "2022-02-26",
"2022-02-26", "2022-02-26", "2022-02-26", NA, "2022-02-27",
"2022-02-27", "2022-02-27", "2022-02-27")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25"))
CodePudding user response:
Here's a cheeky way to do it with lm
.
dat$dateFilled <- as.Date(predict(lm(DATE~DAY ID, dat), newdata = dat), origin = "1970-01-01")
> dat
ID DAY TIME DATE dateFilled
1 1 1 1 <NA> 2021-09-24
2 1 1 2 <NA> 2021-09-24
3 1 1 3 <NA> 2021-09-24
4 1 1 4 <NA> 2021-09-24
5 1 1 5 <NA> 2021-09-24
6 1 2 1 2021-09-25 2021-09-25
7 1 2 2 2021-09-25 2021-09-25
8 1 2 3 2021-09-25 2021-09-25
9 1 2 4 2021-09-25 2021-09-25
10 1 2 5 2021-09-25 2021-09-25
11 1 3 1 <NA> 2021-09-26
12 1 3 2 <NA> 2021-09-26
13 1 3 3 <NA> 2021-09-26
14 1 3 4 <NA> 2021-09-26
15 1 3 5 <NA> 2021-09-26
16 2 1 1 2022-02-26 2022-02-26
17 2 1 2 2022-02-26 2022-02-26
18 2 1 3 2022-02-26 2022-02-26
19 2 1 4 2022-02-26 2022-02-26
20 2 1 5 2022-02-26 2022-02-26
21 2 2 1 <NA> 2022-02-27
22 2 2 2 2022-02-27 2022-02-27
23 2 2 3 2022-02-27 2022-02-27
24 2 2 4 2022-02-27 2022-02-27
25 2 2 5 2022-02-27 2022-02-27