How to fill missing values grouped on id and based on time period from index date-CodePudding

I want to fill in missing values for a data.frame based on a period of time within groups of ID.

For the latest registration_dat within the same ID group, I want to fill in with previous values in the ID group but only if the registration_dat is within 1 year of the latest registration_dat in the ID group.

Sample version of my data:

ID registration_dat  value1      value2
1  2020-03-04          NA          NA
1  2019-05-06          33          25
1  2019-01-02          32          21
3  2021-10-31          NA          NA
3  2018-10-12          33          NA
3  2018-10-10          25          35
4  2020-01-02          NA          NA
4  2019-10-31          32          83
4  2019-09-20          33          56
8  2019-12-12          NA          NA
8  2019-10-31          NA          43
8  2019-08-12          32          46

Desired output:

ID registration_dat  value1      value2
1  2020-03-04          33          25
1  2019-05-06          33          25
1  2019-01-02          32          21
3  2021-10-31          NA          NA
3  2018-10-12          33          NA
3  2018-10-10          25          35
4  2020-01-02          32          83
4  2019-10-31          32          83
4  2019-09-20          33          56
8  2019-12-12          32          43
8  2019-10-31          NA          43
8  2019-08-12          32          46

I am later filtering the data so that i get one unique ID based on the latest registration date and I want this row to have as little missing data as possible hence I want to do this for all columns in the dataframe. However I do not want NA values being filled in by values in previous dates if its more than 1 year apart from the latest registration date. My dataframe has 14 columns and 3 million rows so I would need it to work on a much bigger data.frame than the one shown as an example.

I'd appreciate any ideas!

CodePudding user response：

You could make a small function (f, below) to handle each value column.

Make a grouped ID, and generate a rowid (this is only to retain your original order)

dat <- dat %>% 
  mutate(rowid = row_number()) %>% 
  arrange(registration_dat) %>% 
  group_by(ID)

Make a function that takes a df and val column, and returns and updated df with val fixed

f <- function(df, val) {
  bind_rows(
    df %>% filter(is.na({{val}}) & row_number()!=n()),
    df %>% filter(!is.na({{val}}) | row_number()==n()) %>% 
      mutate({{val}} := if_else(is.na({{val}}) & registration_dat-lag(registration_dat)<365, lag({{val}}),{{val}}))
  )
}

Apply the function to the columns of interest

dat = f(dat,value1)
dat = f(dat,value2)

If you want, recover the original order

dat %>% arrange(rowid) %>% select(-rowid)

Output:

      ID registration_dat value1 value2
   <int> <date>            <int>  <int>
 1     1 2020-03-04           33     25
 2     1 2019-05-06           33     25
 3     1 2019-01-02           32     21
 4     3 2021-10-31           NA     NA
 5     3 2018-10-12           33     NA
 6     3 2018-10-10           25     35
 7     4 2020-01-02           32     83
 8     4 2019-10-31           32     83
 9     4 2019-09-20           33     56
10     8 2019-12-12           32     46
11     8 2019-10-31           NA     43
12     8 2019-08-12           32     46

CodePudding user response：

You can use across() to manipulate multiple columns at the same time. Note that I use date1 - years(1) <= date2 rather than date1 - 365 <= date2 to identify if a date is within 1 year of the latest one, which can take a leap year (366 days) into account.

library(dplyr)
library(lubridate)

df %>%
  group_by(ID) %>%
  arrange(desc(registration_dat), .by_group = TRUE) %>%
  mutate(across(starts_with("value"),
    ~ if_else(row_number() == 1 & is.na(.x) & registration_dat - years(1) <= registration_dat[which.max(!is.na(.x))],
              .x[which.max(!is.na(.x))], .x))) %>%
  ungroup()

# # A tibble: 12 x 4
#       ID registration_dat value1 value2
#    <int> <date>            <int>  <int>
#  1     1 2020-03-04           33     25
#  2     1 2019-05-06           33     25
#  3     1 2019-01-02           32     21
#  4     3 2021-10-31           NA     NA
#  5     3 2018-10-12           33     NA
#  6     3 2018-10-10           25     35
#  7     4 2020-01-02           32     83
#  8     4 2019-10-31           32     83
#  9     4 2019-09-20           33     56
# 10     8 2019-12-12           32     43
# 11     8 2019-10-31           NA     43
# 12     8 2019-08-12           32     46

Data

df <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L, 3L, 4L, 4L, 4L, 8L,
8L, 8L), registration_dat = structure(c(18325, 18022, 17898,
18931, 17816, 17814, 18263, 18200, 18159, 18242, 18200, 18120
), class = "Date"), value1 = c(NA, 33L, 32L, NA, 33L, 25L, NA,
32L, 33L, NA, NA, 32L), value2 = c(NA, 25L, 21L, NA, NA, 35L,
NA, 83L, 56L, NA, 43L, 46L)), class = "data.frame", row.names = c(NA,-12L))