filling missing date for each group in dataframe-CodePudding

I have a very large dataframe(df) with date,id, x columns.

The date column is formatted as date (y.m.d) and has value for some days scatteredly.Each group in the id column has a different start and end date.

tl;dr my goal is to fill the missing date for each group by adding a row with an empty value for x column. df

date	id	X
22.02.02	A	26
22.02.06	A	45
22.02.08	A	46
22.02.05	B	37
22.02.09	B	82
22.02.12	B	56
22.01.13	C	61
22.01.17	C	90
22.01.20	C	43
22.01.23	C	12

my desired output df:

date	id	X
22.02.02	A	26
22.02.03	A
22.02.04	A
22.02.05	A
22.02.06	A	45
22.02.07	A
22.02.08	A	46
22.02.05	B	37
22.02.06	B
22.02.07	B
22.02.08	B
22.02.09	B	82
22.02.10	B
22.02.11	B
22.02.12	B	56
22.01.13	C	61
22.01.14	C
22.01.15	C
22.01.16	C
22.01.17	C	90
22.01.18	C
22.01.19	C
22.01.20	C	43
22.01.21	C
22.01.22	C
22.01.23	C	12

I do not know how to solve this kind of questions, what is the most efficient way to solve this?!

Here my code which does not give me the result:

     df_output <- df %>%
       group_by(id) %>%
        complete(df$date = seq(min(df$date), max(df$date), by = 
        "1 day"), fill = list(number = 0))

I am new to R and using dplyr.

CodePudding user response：

We can try this

library(tidyverse)
library(lubridate)

df$date <- lubridate::as_date(df$date)

df |> group_by(id) |>
      complete(date = seq(min(date), max(date), by = "1 day"), fill = list(number = 0))

CodePudding user response：

First we have to transform to date format to use complete:

library(lubridate)
library(dplyr)
library(tidyr)

df %>%
  as_tibble() %>% 
  mutate(date = ymd(date)) %>% 
  group_by(id) %>%
  complete(date = seq(min(date), max(date), by = 
                           "1 day"), fill = list(number = 0))

   id    date           X
   <chr> <date>     <int>
 1 A     2022-02-02    26
 2 A     2022-02-03    NA
 3 A     2022-02-04    NA
 4 A     2022-02-05    NA
 5 A     2022-02-06    45
 6 A     2022-02-07    NA
 7 A     2022-02-08    46
 8 B     2022-02-05    37
 9 B     2022-02-06    NA
10 B     2022-02-07    NA
# ... with 16 more rows

CodePudding user response：

Another option using the package padr with pad and fill_by_value like this:

library(dplyr)
library(padr)
df %>%
  mutate(date = as.Date(date, format = "%y.%m.%d")) %>%
  group_by(id) %>%
  pad(., interval = "day") %>%
  fill_by_value(., value = NA)
#> # A tibble: 26 × 3
#> # Groups:   id [3]
#>    date       id        X
#>    <date>     <chr> <int>
#>  1 2022-02-02 A        26
#>  2 2022-02-03 A        NA
#>  3 2022-02-04 A        NA
#>  4 2022-02-05 A        NA
#>  5 2022-02-06 A        45
#>  6 2022-02-07 A        NA
#>  7 2022-02-08 A        46
#>  8 2022-02-05 B        37
#>  9 2022-02-06 B        NA
#> 10 2022-02-07 B        NA
#> # … with 16 more rows
#> # ℹ Use `print(n = ...)` to see more rows

^{Created on 2022-08-14 by the reprex package (v2.0.1)}