I have a very large dataframe(df) with date,id, x columns.
The date column is formatted as date (y.m.d) and has value for some days scatteredly.Each group in the id column has a different start and end date.
tl;dr my goal is to fill the missing date for each group by adding a row with an empty value for x column. df
date | id | X |
---|---|---|
22.02.02 | A | 26 |
22.02.06 | A | 45 |
22.02.08 | A | 46 |
22.02.05 | B | 37 |
22.02.09 | B | 82 |
22.02.12 | B | 56 |
22.01.13 | C | 61 |
22.01.17 | C | 90 |
22.01.20 | C | 43 |
22.01.23 | C | 12 |
my desired output df:
date | id | X |
---|---|---|
22.02.02 | A | 26 |
22.02.03 | A | |
22.02.04 | A | |
22.02.05 | A | |
22.02.06 | A | 45 |
22.02.07 | A | |
22.02.08 | A | 46 |
22.02.05 | B | 37 |
22.02.06 | B | |
22.02.07 | B | |
22.02.08 | B | |
22.02.09 | B | 82 |
22.02.10 | B | |
22.02.11 | B | |
22.02.12 | B | 56 |
22.01.13 | C | 61 |
22.01.14 | C | |
22.01.15 | C | |
22.01.16 | C | |
22.01.17 | C | 90 |
22.01.18 | C | |
22.01.19 | C | |
22.01.20 | C | 43 |
22.01.21 | C | |
22.01.22 | C | |
22.01.23 | C | 12 |
I do not know how to solve this kind of questions, what is the most efficient way to solve this?!
Here my code which does not give me the result:
df_output <- df %>%
group_by(id) %>%
complete(df$date = seq(min(df$date), max(df$date), by =
"1 day"), fill = list(number = 0))
I am new to R and using dplyr.
CodePudding user response:
- We can try this
library(tidyverse)
library(lubridate)
df$date <- lubridate::as_date(df$date)
df |> group_by(id) |>
complete(date = seq(min(date), max(date), by = "1 day"), fill = list(number = 0))
CodePudding user response:
First we have to transform to date format to use complete
:
library(lubridate)
library(dplyr)
library(tidyr)
df %>%
as_tibble() %>%
mutate(date = ymd(date)) %>%
group_by(id) %>%
complete(date = seq(min(date), max(date), by =
"1 day"), fill = list(number = 0))
id date X
<chr> <date> <int>
1 A 2022-02-02 26
2 A 2022-02-03 NA
3 A 2022-02-04 NA
4 A 2022-02-05 NA
5 A 2022-02-06 45
6 A 2022-02-07 NA
7 A 2022-02-08 46
8 B 2022-02-05 37
9 B 2022-02-06 NA
10 B 2022-02-07 NA
# ... with 16 more rows
CodePudding user response:
Another option using the package padr
with pad
and fill_by_value
like this:
library(dplyr)
library(padr)
df %>%
mutate(date = as.Date(date, format = "%y.%m.%d")) %>%
group_by(id) %>%
pad(., interval = "day") %>%
fill_by_value(., value = NA)
#> # A tibble: 26 × 3
#> # Groups: id [3]
#> date id X
#> <date> <chr> <int>
#> 1 2022-02-02 A 26
#> 2 2022-02-03 A NA
#> 3 2022-02-04 A NA
#> 4 2022-02-05 A NA
#> 5 2022-02-06 A 45
#> 6 2022-02-07 A NA
#> 7 2022-02-08 A 46
#> 8 2022-02-05 B 37
#> 9 2022-02-06 B NA
#> 10 2022-02-07 B NA
#> # … with 16 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Created on 2022-08-14 by the reprex package (v2.0.1)