Home > Net >  filling missing date for each group in dataframe
filling missing date for each group in dataframe

Time:08-14

I have a very large dataframe(df) with date,id, x columns.

The date column is formatted as date (y.m.d) and has value for some days scatteredly.Each group in the id column has a different start and end date.

tl;dr my goal is to fill the missing date for each group by adding a row with an empty value for x column. df

date id X
22.02.02 A 26
22.02.06 A 45
22.02.08 A 46
22.02.05 B 37
22.02.09 B 82
22.02.12 B 56
22.01.13 C 61
22.01.17 C 90
22.01.20 C 43
22.01.23 C 12

my desired output df:

date id X
22.02.02 A 26
22.02.03 A
22.02.04 A
22.02.05 A
22.02.06 A 45
22.02.07 A
22.02.08 A 46
22.02.05 B 37
22.02.06 B
22.02.07 B
22.02.08 B
22.02.09 B 82
22.02.10 B
22.02.11 B
22.02.12 B 56
22.01.13 C 61
22.01.14 C
22.01.15 C
22.01.16 C
22.01.17 C 90
22.01.18 C
22.01.19 C
22.01.20 C 43
22.01.21 C
22.01.22 C
22.01.23 C 12

I do not know how to solve this kind of questions, what is the most efficient way to solve this?!

Here my code which does not give me the result:

     df_output <- df %>%
       group_by(id) %>%
        complete(df$date = seq(min(df$date), max(df$date), by = 
        "1 day"), fill = list(number = 0))

I am new to R and using dplyr.

CodePudding user response:

  • We can try this
library(tidyverse)
library(lubridate)

df$date <- lubridate::as_date(df$date)

df |> group_by(id) |>
      complete(date = seq(min(date), max(date), by = "1 day"), fill = list(number = 0))

CodePudding user response:

First we have to transform to date format to use complete:

library(lubridate)
library(dplyr)
library(tidyr)

df %>%
  as_tibble() %>% 
  mutate(date = ymd(date)) %>% 
  group_by(id) %>%
  complete(date = seq(min(date), max(date), by = 
                           "1 day"), fill = list(number = 0))
   id    date           X
   <chr> <date>     <int>
 1 A     2022-02-02    26
 2 A     2022-02-03    NA
 3 A     2022-02-04    NA
 4 A     2022-02-05    NA
 5 A     2022-02-06    45
 6 A     2022-02-07    NA
 7 A     2022-02-08    46
 8 B     2022-02-05    37
 9 B     2022-02-06    NA
10 B     2022-02-07    NA
# ... with 16 more rows

CodePudding user response:

Another option using the package padr with pad and fill_by_value like this:

library(dplyr)
library(padr)
df %>%
  mutate(date = as.Date(date, format = "%y.%m.%d")) %>%
  group_by(id) %>%
  pad(., interval = "day") %>%
  fill_by_value(., value = NA)
#> # A tibble: 26 × 3
#> # Groups:   id [3]
#>    date       id        X
#>    <date>     <chr> <int>
#>  1 2022-02-02 A        26
#>  2 2022-02-03 A        NA
#>  3 2022-02-04 A        NA
#>  4 2022-02-05 A        NA
#>  5 2022-02-06 A        45
#>  6 2022-02-07 A        NA
#>  7 2022-02-08 A        46
#>  8 2022-02-05 B        37
#>  9 2022-02-06 B        NA
#> 10 2022-02-07 B        NA
#> # … with 16 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Created on 2022-08-14 by the reprex package (v2.0.1)

  • Related