Home > Software design >  Fill in Missing Dates with Merge in R
Fill in Missing Dates with Merge in R

Time:06-22

df1: has all dates from 2021-04-20 to 2022-05-15

Date
2021-04-20
2021-04-21
2021-04-22
2021-04-23
2021-04-24
2021-04-25
2021-04-26
...
2022-04-12
2022-04-13
2022-04-14
2022-05-15

df2: Random and incomplete list of dates between 2021-04-20 to 2022-05-15

Date Group
2021-04-20 1001
2021-04-21 1001
2021-04-25 1001
2021-04-27 1001
2021-04-30 1001
2021-05-05 1001
2021-06-06 1001
2021-04-20 1002
2021-04-21 1002
2021-04-22 1002
2021-04-23 1002
2021-05-24 1002
2021-05-22 1002
2021-06-23 1002
2021-06-24 1002
2021-07-01 1002
2021-07-12 1002
2021-08-09 1002

dfdesired: Want to fill in incomplete dates in df2 with dates in df1. I want to do this by group. This means 1001 should start at 2021-04-20 and end at 2022-05-15. 1002 should also start at 2021-04-20 and end at 2022-05-15.

Date Group
2021-04-20 1001
2021-04-21 1001
2021-04-22 1001
2021-04-23 1001
2021-04-24 1001
2021-04-25 1001
2021-04-26 1001
... ...
2022-04-12 1001
2022-04-13 1001
2022-04-14 1001
2022-05-15 1001
2021-04-20 1002
2021-04-21 1002
2021-04-22 1002
2021-04-23 1002
2021-04-24 1002
2021-04-25 1002
2021-04-26 1002
... ...
2022-04-12 1002
2022-04-13 1002
2022-04-14 1002
2022-05-15 1002

Not sure how to do this. Any help would be appreciated.

CodePudding user response:

You can do a "blind merge" (cartesian expansion) with base::merge(..., by=NULL). Here, I'm starting with a frame of just unique Group values on a new frame of the full date sequence.

dates <- data.frame(Date = seq(as.Date("2021-04-20"), as.Date("2022-05-15"), by="day"))
out <- merge(unique(dat[,c("Group"), drop = FALSE]), dates, by = NULL)
head(out)
#   Group       Date
# 1  1001 2021-04-20
# 2  1002 2021-04-20
# 3  1001 2021-04-21
# 4  1002 2021-04-21
# 5  1001 2021-04-22
# 6  1002 2021-04-22
tail(out)
#     Group       Date
# 777  1001 2022-05-13
# 778  1002 2022-05-13
# 779  1001 2022-05-14
# 780  1002 2022-05-14
# 781  1001 2022-05-15
# 782  1002 2022-05-15

Data:

dat <- structure(list(Date = c("2021-04-20", "2021-04-21", "2021-04-25", "2021-04-27", "2021-04-30", "2021-05-05", "2021-06-06", "2021-04-20", "2021-04-21", "2021-04-22", "2021-04-23", "2021-05-24", "2021-05-22", "2021-06-23", "2021-06-24", "2021-07-01", "2021-07-12", "2021-08-09"), Group = c(1001L, 1001L, 1001L, 1001L, 1001L, 1001L, 1001L, 1002L, 1002L, 1002L, 1002L, 1002L, 1002L, 1002L, 1002L, 1002L, 1002L, 1002L)), class = "data.frame", row.names = c(NA, -18L))

CodePudding user response:

Here is a dplyr solution:

  1. We use anti_join to get the date that are not in df2.

  2. then we use bind_rows to bind these values to df2

  3. after getting date format with ymd we arrange for year and fill Group column.

  4. some tweaking.

library(dplyr)
library(lubridate)

anti_join(df1, df2) %>% 
  bind_rows(df2) %>% 
  mutate(Date = ymd(Date)) %>% 
  arrange(year(Date)) %>% 
  fill(Group, .direction = "updown") %>% 
  select(1,2) %>% 
  arrange(Group, Date)
              Date Group
1  2021-04-20  1001
2  2021-04-21  1001
3  2021-04-24  1001
4  2021-04-25  1001
5  2021-04-26  1001
6  2021-04-27  1001
7  2021-04-30  1001
8  2021-05-05  1001
9  2021-06-06  1001
10 2021-04-20  1002
11 2021-04-21  1002
12 2021-04-22  1002
13 2021-04-23  1002
14 2021-05-22  1002
15 2021-05-24  1002
16 2021-06-23  1002
17 2021-06-24  1002
18 2021-07-01  1002
19 2021-07-12  1002
20 2021-08-09  1002
21 2022-04-12  1002
22 2022-04-13  1002
23 2022-04-14  1002
24 2022-05-15  1002
  • Related