Fill dates by groups in a data frame R-CodePudding

I wanna go from this:

name	code	date	usage	result
Jennifer Aniston	23211	2021-11-04	345	1
Jennifer Aniston	23211	2021-11-05	260	1
Jennifer Aniston	23211	2021-11-06	230	0
Jennifer Aniston	23211	2021-11-07	0	0
Matthew Perry	44215	2022-10-01	312	1
Matthew Perry	44215	2022-10-04	230	0
Matthew Perry	44215	2021-10-05	232	0
Lisa Kudrow	55120	2022-01-01	132	0
Lisa Kudrow	55120	2022-01-02	125	0
Lisa Kudrow	55120	2022-01-04	345	1
Lisa Kudrow	55120	2022-01-06	321	1
Lisa Kudrow	55120	2022-01-07	431	1

(note: for Jennifer Aniston we have all the dates from the minimum date that appears and the maximum date, but for Matthew Perry we have missing dates, it starts at 2022-10-01 but we don't have the 2 and the 3 of october. The same happens with Lisa Kudrow, it starts at 1 january but we miss 3 and 5 of january)

to this:

name	code	date	usage	result
Jennifer Aniston	23211	2021-11-04	345	1
Jennifer Aniston	23211	2021-11-05	260	1
Jennifer Aniston	23211	2021-11-06	230	0
Jennifer Aniston	23211	2021-11-07	0	0
Matthew Perry	44215	2022-10-01	312	1
Matthew Perry	44215	2022-10-02	NA	NA
Matthew Perry	44215	2022-10-03	NA	NA
Matthew Perry	44215	2022-10-04	230	0
Matthew Perry	44215	2021-10-05	232	0
Lisa Kudrow	55120	2022-01-01	132	0
Lisa Kudrow	55120	2022-01-02	125	0
Lisa Kudrow	55120	2022-01-03	NA	NA
Lisa Kudrow	55120	2022-01-04	345	1
Lisa Kudrow	55120	2022-01-05	NA	NA
Lisa Kudrow	55120	2022-01-06	321	1
Lisa Kudrow	55120	2022-01-07	431	1

So now we have all the dates, filling with an NA where we don't have data available and with the name and the code for the person.

Any idea of howing this in R? (preferibly using dplyr and pipes)

CodePudding user response：

Assuming that your dates are actually in date format rather than character format (we can't tell from the table in the question), and assuming that the 7th row has the wrong year in it (2021 as opposed to 2022), you can do:

library(tidyverse)

df %>% 
  split(.$name) %>%
  lapply(function(x) {
    complete(x, expand(x, date = seq(min(x$date), max(x$date), by = 'day')),
             fill = list(name = x$name[1], code = x$code[1]))}) %>%
  bind_rows()
#> # A tibble: 16 x 5
#>    date       name              code usage result
#>    <date>     <chr>            <int> <int>  <int>
#>  1 2021-11-04 Jennifer Aniston 23211   345      1
#>  2 2021-11-05 Jennifer Aniston 23211   260      1
#>  3 2021-11-06 Jennifer Aniston 23211   230      0
#>  4 2021-11-07 Jennifer Aniston 23211     0      0
#>  5 2022-01-01 Lisa Kudrow      55120   132      0
#>  6 2022-01-02 Lisa Kudrow      55120   125      0
#>  7 2022-01-03 Lisa Kudrow      55120    NA     NA
#>  8 2022-01-04 Lisa Kudrow      55120   345      1
#>  9 2022-01-05 Lisa Kudrow      55120    NA     NA
#> 10 2022-01-06 Lisa Kudrow      55120   321      1
#> 11 2022-01-07 Lisa Kudrow      55120   431      1
#> 12 2022-10-01 Matthew Perry    44215   312      1
#> 13 2022-10-02 Matthew Perry    44215    NA     NA
#> 14 2022-10-03 Matthew Perry    44215    NA     NA
#> 15 2022-10-04 Matthew Perry    44215   230      0
#> 16 2022-10-05 Matthew Perry    44215   232      0

^{Created on 2022-06-02 by the reprex package (v2.0.1)}

Data taken from question in reproducible format

df <- structure(list(name = c("Jennifer Aniston", "Jennifer Aniston", 
"Jennifer Aniston", "Jennifer Aniston", "Matthew Perry", "Matthew Perry", 
"Matthew Perry", "Lisa Kudrow", "Lisa Kudrow", "Lisa Kudrow", 
"Lisa Kudrow", "Lisa Kudrow"), code = c(23211L, 23211L, 23211L, 
23211L, 44215L, 44215L, 44215L, 55120L, 55120L, 55120L, 55120L, 
55120L), date = structure(c(18935, 18936, 18937, 18938, 19266, 
19269, 19270, 18993, 18994, 18996, 18998, 18999), class = "Date"), 
    usage = c(345L, 260L, 230L, 0L, 312L, 230L, 232L, 132L, 125L, 
    345L, 321L, 431L), result = c(1L, 1L, 0L, 0L, 1L, 0L, 0L, 
    0L, 0L, 1L, 1L, 1L)), row.names = c(NA, -12L), class = "data.frame")

CodePudding user response：

Allan's answer is very nice. However, as I faced difficulties with expand() and complete() in the past, I made a custom function that I often use now. Maybe you will find it useful (and if you improve it please let me know ;-) ).

See the example below, using df from Allan's post:

autocomplete <- function(.data, .nestingvars, .indexvar) {
  # initialize the index that will expand the groups formed in .data by .nestingvars
  index <- data.frame(index = seq.Date(from = min(.data[[.indexvar]]),
                                       to = max(.data[[.indexvar]]),
                                       by = "day"))
  names(index)[1] <- .indexvar
  # peform a cross join to get all possible combinations of .nestingvars and .indexvars
  index <- base::merge(index, unique(.data[.nestingvars]), by = NULL)
  # merge index and data
  out <- base::merge(.data, index, by = c(.nestingvars, .indexvar), all.y = TRUE)
  return(out)
}

library(dplyr)
split(df, ~ name, drop = TRUE) %>% 
  purrr::map(.x = ., 
             .f = ~ autocomplete(.data = ., 
                                 .nestingvars = c("name", "code"), 
                                 .indexvar = "date")) %>% 
  bind_rows()
#>                name  code       date usage result
#> 1  Jennifer Aniston 23211 2021-11-04   345      1
#> 2  Jennifer Aniston 23211 2021-11-05   260      1
#> 3  Jennifer Aniston 23211 2021-11-06   230      0
#> 4  Jennifer Aniston 23211 2021-11-07     0      0
#> 5       Lisa Kudrow 55120 2022-01-01   132      0
#> 6       Lisa Kudrow 55120 2022-01-02   125      0
#> 7       Lisa Kudrow 55120 2022-01-03    NA     NA
#> 8       Lisa Kudrow 55120 2022-01-04   345      1
#> 9       Lisa Kudrow 55120 2022-01-05    NA     NA
#> 10      Lisa Kudrow 55120 2022-01-06   321      1
#> 11      Lisa Kudrow 55120 2022-01-07   431      1
#> 12    Matthew Perry 44215 2022-10-01   312      1
#> 13    Matthew Perry 44215 2022-10-02    NA     NA
#> 14    Matthew Perry 44215 2022-10-03    NA     NA
#> 15    Matthew Perry 44215 2022-10-04   230      0
#> 16    Matthew Perry 44215 2022-10-05   232      0

^{Created on 2022-06-02 by the reprex package (v2.0.1)}