Impute missing values with a value from previous month (if exists)-CodePudding

I have a dataframe with more than 100 000 rows and 30 000 unique ids.

My aim is to fill all the NAs among the different columns if there is a value from the previous month and the same id. However, most of the times the previous recorded value is from more than a month ago. Those NAs I would like to leave untouched.

The id column and the date column do not have NAs. Here is an example of the data I have:

df3

id    oxygen      gluco        dias         bp        date
1   0,25897842  0,20201604  0,17955655  0,14100962  31.7.2019
2   NA          NA          0,38582622  0,12918231  31.12.2014
2   0,35817147  0,32943499  NA          0,43667462  30.11.2018
2   0,68557053  0,42898807  0,93897514  NA          31.10.2018
2   NA          NA          0,99899076  0,44168223  31.7.2018
2   0,43848054  0,38604586  NA          NA          30.4.2013
2   0,15823254  0,06216771  0,07829624  0,69755251  31.1.2016
2   NA          NA          0,61645303  NA          29.2.2016
2   0,94671363  0,50682091  0,96770222  0,97403356  31.5.2018
3   NA          0,77352235  0,660479    0,11554399  30.4.2019
3   0,15567703  NA          0,4553325   NA          31.3.2017
3   NA          NA          0,22181609  0,08527658  30.9.2017
3   0,93660763  NA          NA          NA          31.3.2018
3   0,73416759  NA          NA          0,78501791  30.11.2018
3   NA          NA          NA          NA          28.2.2019
3   0,84525106  0,54360374  NA          0,40595426  31.8.2014
3   0,76221263  0,62983336  0,84592719  0,10640734  31.8.2013
4   NA          0,29108942  0,3863479   NA          31.1.2018
4   0,74075742  NA          0,38117415  0,58849266  30.11.2018
4   0,09400641  0,68860814  NA          0,88895224  31.8.2014
4   0,72202944  0,49901387  0,19967415  NA          31.8.2018
4   0,98205262  0,85213969  0,34450998  0,98962306  30.11.2013

This is the last code implementation that I have tried:

´´´

df3 %>%
    group_by(id) %>%
    mutate_all(funs(na.locf(., na.rm = FALSE, maxgap = 30)))

´´´

But apparently "mutate_all() ignored the following grouping variables: Column id"

CodePudding user response：

You can use the tidyverse for that. Here's an approach:

Change the date column to class Date, then order by date
Prepare the dates and remove the days in Ym
get the time difference in mo
flag the rows which have max one month difference
get groups by cumsum the inverse logic in flag
fill the rows from the same groups

library(dplyr)
library(tidyr)
library(lubridate)

df$date <- as.Date(df$date, format="%d.%m.%Y")

df %>% 
  arrange(date) %>%
  mutate(
     Ym = ym(strftime(date, "%Y-%m")), 
     mo = interval(Ym, lag(Ym, default=as.Date("1970-01-01"))) / months(1), 
     flag = cumsum(!(mo > -2 & mo < 1))) %>% 
  group_by(id, flag) %>% 
  fill(names(.), .direction="down") %>% 
  ungroup() %>% 
  select(-c("Ym","mo","flag")) %>% 
  print(n=nrow(.))

Output

# A tibble: 22 × 6
      id oxygen     gluco      dias       bp         date      
   <int> <chr>      <chr>      <chr>      <chr>      <date>    
 1     2 0,43848054 0,38604586 NA         NA         2013-04-30
 2     3 0,76221263 0,62983336 0,84592719 0,10640734 2013-08-31
 3     4 0,98205262 0,85213969 0,34450998 0,98962306 2013-11-30
 4     3 0,84525106 0,54360374 NA         0,40595426 2014-08-31
 5     4 0,09400641 0,68860814 NA         0,88895224 2014-08-31
 6     2 NA         NA         0,38582622 0,12918231 2014-12-31
 7     2 0,15823254 0,06216771 0,07829624 0,69755251 2016-01-31
 8     2 0,15823254 0,06216771 0,61645303 0,69755251 2016-02-29
 9     3 0,15567703 NA         0,4553325  NA         2017-03-31
10     3 NA         NA         0,22181609 0,08527658 2017-09-30
11     4 NA         0,29108942 0,3863479  NA         2018-01-31
12     3 0,93660763 NA         NA         NA         2018-03-31
13     2 0,94671363 0,50682091 0,96770222 0,97403356 2018-05-31
14     2 NA         NA         0,99899076 0,44168223 2018-07-31
15     4 0,72202944 0,49901387 0,19967415 NA         2018-08-31
16     2 0,68557053 0,42898807 0,93897514 NA         2018-10-31
17     2 0,35817147 0,32943499 0,93897514 0,43667462 2018-11-30
18     3 0,73416759 NA         NA         0,78501791 2018-11-30
19     4 0,74075742 NA         0,38117415 0,58849266 2018-11-30
20     3 NA         NA         NA         NA         2019-02-28
21     3 NA         0,77352235 0,660479   0,11554399 2019-04-30
22     1 0,25897842 0,20201604 0,17955655 0,14100962 2019-07-31

Data

df <- structure(list(id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), oxygen = c("0,25897842", 
NA, "0,35817147", "0,68557053", NA, "0,43848054", "0,15823254", 
NA, "0,94671363", NA, "0,15567703", NA, "0,93660763", "0,73416759", 
NA, "0,84525106", "0,76221263", NA, "0,74075742", "0,09400641", 
"0,72202944", "0,98205262"), gluco = c("0,20201604", NA, "0,32943499", 
"0,42898807", NA, "0,38604586", "0,06216771", NA, "0,50682091", 
"0,77352235", NA, NA, NA, NA, NA, "0,54360374", "0,62983336", 
"0,29108942", NA, "0,68860814", "0,49901387", "0,85213969"), 
    dias = c("0,17955655", "0,38582622", NA, "0,93897514", "0,99899076", 
    NA, "0,07829624", "0,61645303", "0,96770222", "0,660479", 
    "0,4553325", "0,22181609", NA, NA, NA, NA, "0,84592719", 
    "0,3863479", "0,38117415", NA, "0,19967415", "0,34450998"
    ), bp = c("0,14100962", "0,12918231", "0,43667462", NA, "0,44168223", 
    NA, "0,69755251", NA, "0,97403356", "0,11554399", NA, "0,08527658", 
    NA, "0,78501791", NA, "0,40595426", "0,10640734", NA, "0,58849266", 
    "0,88895224", NA, "0,98962306"), date = structure(c(18108, 
    16435, 17865, 17835, 17743, 15825, 16831, 16860, 17682, 18016, 
    17256, 17439, 17621, 17865, 17955, 16313, 15948, 17562, 17865, 
    16313, 17774, 16039), class = "Date")), row.names = c(NA, 
-22L), class = "data.frame")