Home > Software engineering >  What function can I use to complete and fill missing time series observations, avoiding completion b
What function can I use to complete and fill missing time series observations, avoiding completion b

Time:04-01

I have long time series dataframe grouped by id. The series have different start dates and also missing observations. I want to complete missing observations, by completing the the date and id and filling it with 0.

What I want to avoid in the process, is to complete the missing observations in the beginning, because this is just an indicator, that the time series has a later starting point (different launch date of product for example).

In my reprex I used complete from tidyr. It does the opposite of what I want. Instead of completing the id "A1" with "2015-01-04", it completes the id "B1" with "2015-01-01", which is not needed in this case. Does complete always create groups of the same size? Maybe then it is the wrong function.

How can I achieve the opposite in the following example?

library(tidyr)

data <- data.frame (id = as.character(c(rep("A1",6),rep("B1",5))),
                    value = c(seq( 1, 9, length.out = 11)),
                    date = as.Date(c(c("2015-01-01","2015-01-02","2015-01-03",
                                         "2015-01-05","2015-01-06","2015-01-07"),
                                       c("2015-01-02","2015-01-03","2015-01-05",
                                         "2015-01-06","2015-01-07")
                                      )
                    )
)

data %>% complete(date, id, fill = list(value = 0)) 

CodePudding user response:

You need to provide the dates to fill explicitly:

data %>%
  group_by(id) %>%
  complete(date = seq(min(date), max(date), by = 1), fill = list(value = 0))

CodePudding user response:

Doing it rectangularly is easiest to express. You can reintroduce the missingness as follows:

data %>% 
  tidyr::complete(date, id, fill = list(value = 0)) %>%
  dplyr::group_by(id) %>%
  dplyr::arrange(date) %>%   # Ensure it's sorted by date
  dplyr::filter(!cumall(value == 0)) %>%  # Don't keep zeros that didn't have non-0 rows before
  dplyr::ungroup()

CodePudding user response:

library(tidyverse)

data <- data.frame(
  id = as.character(c(rep("A1", 6), rep("B1", 5))),
  value = c(seq(1, 9, length.out = 11)),
  date = as.Date(c(
    c(
      "2015-01-01", "2015-01-02", "2015-01-03",
      "2015-01-05", "2015-01-06", "2015-01-07"
    ),
    c(
      "2015-01-02", "2015-01-03", "2015-01-05",
      "2015-01-06", "2015-01-07"
    )
  ))
)


all_dates <- seq(min(data$date), max(data$date), by = "day") %>% as.character()

# complete all dates for each id
data %>%
  as_tibble() %>%
  group_by(id) %>%
  mutate(date = date %>% as.character() %>% factor(levels = all_dates)) %>%
  complete(date, fill = list(value = 0)) %>%
  mutate(date = date %>% as.Date())
#> # A tibble: 14 × 3
#> # Groups:   id [2]
#>    id    date       value
#>    <chr> <date>     <dbl>
#>  1 A1    2015-01-01   1  
#>  2 A1    2015-01-02   1.8
#>  3 A1    2015-01-03   2.6
#>  4 A1    2015-01-04   0  
#>  5 A1    2015-01-05   3.4
#>  6 A1    2015-01-06   4.2
#>  7 A1    2015-01-07   5  
#>  8 B1    2015-01-01   0  
#>  9 B1    2015-01-02   5.8
#> 10 B1    2015-01-03   6.6
#> 11 B1    2015-01-04   0  
#> 12 B1    2015-01-05   7.4
#> 13 B1    2015-01-06   8.2
#> 14 B1    2015-01-07   9

Created on 2022-04-01 by the reprex package (v2.0.0)

CodePudding user response:

This is not very elegant, but it works.

data.frame(date = rep(dates, length(id)),
           id = rep(ids, each = length(dates))) |> 
        full_join(data) |>
        arrange(id, date) |>
        group_by(id) |>
        filter(!is.na(value) | row_number() > 1) |> 
        mutate(value = replace_na(value, 0)) |> 
        ungroup()
  • Related