I have long time series dataframe grouped by id. The series have different start dates and also missing observations. I want to complete missing observations, by completing the the date and id and filling it with 0.
What I want to avoid in the process, is to complete the missing observations in the beginning, because this is just an indicator, that the time series has a later starting point (different launch date of product for example).
In my reprex I used complete
from tidyr
. It does the opposite of what I want. Instead of completing the id "A1" with "2015-01-04", it completes the id "B1" with "2015-01-01", which is not needed in this case. Does complete always create groups of the same size? Maybe then it is the wrong function.
How can I achieve the opposite in the following example?
library(tidyr)
data <- data.frame (id = as.character(c(rep("A1",6),rep("B1",5))),
value = c(seq( 1, 9, length.out = 11)),
date = as.Date(c(c("2015-01-01","2015-01-02","2015-01-03",
"2015-01-05","2015-01-06","2015-01-07"),
c("2015-01-02","2015-01-03","2015-01-05",
"2015-01-06","2015-01-07")
)
)
)
data %>% complete(date, id, fill = list(value = 0))
CodePudding user response:
You need to provide the dates to fill explicitly:
data %>%
group_by(id) %>%
complete(date = seq(min(date), max(date), by = 1), fill = list(value = 0))
CodePudding user response:
Doing it rectangularly is easiest to express. You can reintroduce the missingness as follows:
data %>%
tidyr::complete(date, id, fill = list(value = 0)) %>%
dplyr::group_by(id) %>%
dplyr::arrange(date) %>% # Ensure it's sorted by date
dplyr::filter(!cumall(value == 0)) %>% # Don't keep zeros that didn't have non-0 rows before
dplyr::ungroup()
CodePudding user response:
library(tidyverse)
data <- data.frame(
id = as.character(c(rep("A1", 6), rep("B1", 5))),
value = c(seq(1, 9, length.out = 11)),
date = as.Date(c(
c(
"2015-01-01", "2015-01-02", "2015-01-03",
"2015-01-05", "2015-01-06", "2015-01-07"
),
c(
"2015-01-02", "2015-01-03", "2015-01-05",
"2015-01-06", "2015-01-07"
)
))
)
all_dates <- seq(min(data$date), max(data$date), by = "day") %>% as.character()
# complete all dates for each id
data %>%
as_tibble() %>%
group_by(id) %>%
mutate(date = date %>% as.character() %>% factor(levels = all_dates)) %>%
complete(date, fill = list(value = 0)) %>%
mutate(date = date %>% as.Date())
#> # A tibble: 14 × 3
#> # Groups: id [2]
#> id date value
#> <chr> <date> <dbl>
#> 1 A1 2015-01-01 1
#> 2 A1 2015-01-02 1.8
#> 3 A1 2015-01-03 2.6
#> 4 A1 2015-01-04 0
#> 5 A1 2015-01-05 3.4
#> 6 A1 2015-01-06 4.2
#> 7 A1 2015-01-07 5
#> 8 B1 2015-01-01 0
#> 9 B1 2015-01-02 5.8
#> 10 B1 2015-01-03 6.6
#> 11 B1 2015-01-04 0
#> 12 B1 2015-01-05 7.4
#> 13 B1 2015-01-06 8.2
#> 14 B1 2015-01-07 9
Created on 2022-04-01 by the reprex package (v2.0.0)
CodePudding user response:
This is not very elegant, but it works.
data.frame(date = rep(dates, length(id)),
id = rep(ids, each = length(dates))) |>
full_join(data) |>
arrange(id, date) |>
group_by(id) |>
filter(!is.na(value) | row_number() > 1) |>
mutate(value = replace_na(value, 0)) |>
ungroup()