Home > other >  How to fill missing values grouped on id and based on time period from index date
How to fill missing values grouped on id and based on time period from index date

Time:04-19

I want to fill in missing values for a data.frame based on a period of time within groups of ID.

For the latest registration_dat within the same ID group, I want to fill in with previous values in the ID group but only if the registration_dat is within 1 year of the latest registration_dat in the ID group.

Sample version of my data:

ID registration_dat  value1      value2
1  2020-03-04          NA          NA
1  2019-05-06          33          25
1  2019-01-02          32          21
3  2021-10-31          NA          NA
3  2018-10-12          33          NA
3  2018-10-10          25          35
4  2020-01-02          NA          NA
4  2019-10-31          32          83
4  2019-09-20          33          56
8  2019-12-12          NA          NA
8  2019-10-31          NA          43
8  2019-08-12          32          46

Desired output:

ID registration_dat  value1      value2
1  2020-03-04          33          25
1  2019-05-06          33          25
1  2019-01-02          32          21
3  2021-10-31          NA          NA
3  2018-10-12          33          NA
3  2018-10-10          25          35
4  2020-01-02          32          83
4  2019-10-31          32          83
4  2019-09-20          33          56
8  2019-12-12          32          43
8  2019-10-31          NA          43
8  2019-08-12          32          46

I am later filtering the data so that i get one unique ID based on the latest registration date and I want this row to have as little missing data as possible hence I want to do this for all columns in the dataframe. However I do not want NA values being filled in by values in previous dates if its more than 1 year apart from the latest registration date. My dataframe has 14 columns and 3 million rows so I would need it to work on a much bigger data.frame than the one shown as an example.

I'd appreciate any ideas!

CodePudding user response:

You could make a small function (f, below) to handle each value column.

  1. Make a grouped ID, and generate a rowid (this is only to retain your original order)
dat <- dat %>% 
  mutate(rowid = row_number()) %>% 
  arrange(registration_dat) %>% 
  group_by(ID)
  1. Make a function that takes a df and val column, and returns and updated df with val fixed
f <- function(df, val) {
  bind_rows(
    df %>% filter(is.na({{val}}) & row_number()!=n()),
    df %>% filter(!is.na({{val}}) | row_number()==n()) %>% 
      mutate({{val}} := if_else(is.na({{val}}) & registration_dat-lag(registration_dat)<365, lag({{val}}),{{val}}))
  )
}
  1. Apply the function to the columns of interest
dat = f(dat,value1)
dat = f(dat,value2)
  1. If you want, recover the original order
dat %>% arrange(rowid) %>% select(-rowid)

Output:

      ID registration_dat value1 value2
   <int> <date>            <int>  <int>
 1     1 2020-03-04           33     25
 2     1 2019-05-06           33     25
 3     1 2019-01-02           32     21
 4     3 2021-10-31           NA     NA
 5     3 2018-10-12           33     NA
 6     3 2018-10-10           25     35
 7     4 2020-01-02           32     83
 8     4 2019-10-31           32     83
 9     4 2019-09-20           33     56
10     8 2019-12-12           32     46
11     8 2019-10-31           NA     43
12     8 2019-08-12           32     46

CodePudding user response:

You can use across() to manipulate multiple columns at the same time. Note that I use date1 - years(1) <= date2 rather than date1 - 365 <= date2 to identify if a date is within 1 year of the latest one, which can take a leap year (366 days) into account.

library(dplyr)
library(lubridate)

df %>%
  group_by(ID) %>%
  arrange(desc(registration_dat), .by_group = TRUE) %>%
  mutate(across(starts_with("value"),
    ~ if_else(row_number() == 1 & is.na(.x) & registration_dat - years(1) <= registration_dat[which.max(!is.na(.x))],
              .x[which.max(!is.na(.x))], .x))) %>%
  ungroup()

# # A tibble: 12 x 4
#       ID registration_dat value1 value2
#    <int> <date>            <int>  <int>
#  1     1 2020-03-04           33     25
#  2     1 2019-05-06           33     25
#  3     1 2019-01-02           32     21
#  4     3 2021-10-31           NA     NA
#  5     3 2018-10-12           33     NA
#  6     3 2018-10-10           25     35
#  7     4 2020-01-02           32     83
#  8     4 2019-10-31           32     83
#  9     4 2019-09-20           33     56
# 10     8 2019-12-12           32     43
# 11     8 2019-10-31           NA     43
# 12     8 2019-08-12           32     46

Data
df <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L, 3L, 4L, 4L, 4L, 8L,
8L, 8L), registration_dat = structure(c(18325, 18022, 17898,
18931, 17816, 17814, 18263, 18200, 18159, 18242, 18200, 18120
), class = "Date"), value1 = c(NA, 33L, 32L, NA, 33L, 25L, NA,
32L, 33L, NA, NA, 32L), value2 = c(NA, 25L, 21L, NA, NA, 35L,
NA, 83L, 56L, NA, 43L, 46L)), class = "data.frame", row.names = c(NA,-12L))
  • Related