I wanna go from this:
name | code | date | usage | result |
---|---|---|---|---|
Jennifer Aniston | 23211 | 2021-11-04 | 345 | 1 |
Jennifer Aniston | 23211 | 2021-11-05 | 260 | 1 |
Jennifer Aniston | 23211 | 2021-11-06 | 230 | 0 |
Jennifer Aniston | 23211 | 2021-11-07 | 0 | 0 |
Matthew Perry | 44215 | 2022-10-01 | 312 | 1 |
Matthew Perry | 44215 | 2022-10-04 | 230 | 0 |
Matthew Perry | 44215 | 2021-10-05 | 232 | 0 |
Lisa Kudrow | 55120 | 2022-01-01 | 132 | 0 |
Lisa Kudrow | 55120 | 2022-01-02 | 125 | 0 |
Lisa Kudrow | 55120 | 2022-01-04 | 345 | 1 |
Lisa Kudrow | 55120 | 2022-01-06 | 321 | 1 |
Lisa Kudrow | 55120 | 2022-01-07 | 431 | 1 |
(note: for Jennifer Aniston we have all the dates from the minimum date that appears and the maximum date, but for Matthew Perry we have missing dates, it starts at 2022-10-01 but we don't have the 2 and the 3 of october. The same happens with Lisa Kudrow, it starts at 1 january but we miss 3 and 5 of january)
to this:
name | code | date | usage | result |
---|---|---|---|---|
Jennifer Aniston | 23211 | 2021-11-04 | 345 | 1 |
Jennifer Aniston | 23211 | 2021-11-05 | 260 | 1 |
Jennifer Aniston | 23211 | 2021-11-06 | 230 | 0 |
Jennifer Aniston | 23211 | 2021-11-07 | 0 | 0 |
Matthew Perry | 44215 | 2022-10-01 | 312 | 1 |
Matthew Perry | 44215 | 2022-10-02 | NA | NA |
Matthew Perry | 44215 | 2022-10-03 | NA | NA |
Matthew Perry | 44215 | 2022-10-04 | 230 | 0 |
Matthew Perry | 44215 | 2021-10-05 | 232 | 0 |
Lisa Kudrow | 55120 | 2022-01-01 | 132 | 0 |
Lisa Kudrow | 55120 | 2022-01-02 | 125 | 0 |
Lisa Kudrow | 55120 | 2022-01-03 | NA | NA |
Lisa Kudrow | 55120 | 2022-01-04 | 345 | 1 |
Lisa Kudrow | 55120 | 2022-01-05 | NA | NA |
Lisa Kudrow | 55120 | 2022-01-06 | 321 | 1 |
Lisa Kudrow | 55120 | 2022-01-07 | 431 | 1 |
So now we have all the dates, filling with an NA where we don't have data available and with the name and the code for the person.
Any idea of howing this in R? (preferibly using dplyr and pipes)
CodePudding user response:
Assuming that your dates are actually in date format rather than character format (we can't tell from the table in the question), and assuming that the 7th row has the wrong year in it (2021 as opposed to 2022), you can do:
library(tidyverse)
df %>%
split(.$name) %>%
lapply(function(x) {
complete(x, expand(x, date = seq(min(x$date), max(x$date), by = 'day')),
fill = list(name = x$name[1], code = x$code[1]))}) %>%
bind_rows()
#> # A tibble: 16 x 5
#> date name code usage result
#> <date> <chr> <int> <int> <int>
#> 1 2021-11-04 Jennifer Aniston 23211 345 1
#> 2 2021-11-05 Jennifer Aniston 23211 260 1
#> 3 2021-11-06 Jennifer Aniston 23211 230 0
#> 4 2021-11-07 Jennifer Aniston 23211 0 0
#> 5 2022-01-01 Lisa Kudrow 55120 132 0
#> 6 2022-01-02 Lisa Kudrow 55120 125 0
#> 7 2022-01-03 Lisa Kudrow 55120 NA NA
#> 8 2022-01-04 Lisa Kudrow 55120 345 1
#> 9 2022-01-05 Lisa Kudrow 55120 NA NA
#> 10 2022-01-06 Lisa Kudrow 55120 321 1
#> 11 2022-01-07 Lisa Kudrow 55120 431 1
#> 12 2022-10-01 Matthew Perry 44215 312 1
#> 13 2022-10-02 Matthew Perry 44215 NA NA
#> 14 2022-10-03 Matthew Perry 44215 NA NA
#> 15 2022-10-04 Matthew Perry 44215 230 0
#> 16 2022-10-05 Matthew Perry 44215 232 0
Created on 2022-06-02 by the reprex package (v2.0.1)
Data taken from question in reproducible format
df <- structure(list(name = c("Jennifer Aniston", "Jennifer Aniston",
"Jennifer Aniston", "Jennifer Aniston", "Matthew Perry", "Matthew Perry",
"Matthew Perry", "Lisa Kudrow", "Lisa Kudrow", "Lisa Kudrow",
"Lisa Kudrow", "Lisa Kudrow"), code = c(23211L, 23211L, 23211L,
23211L, 44215L, 44215L, 44215L, 55120L, 55120L, 55120L, 55120L,
55120L), date = structure(c(18935, 18936, 18937, 18938, 19266,
19269, 19270, 18993, 18994, 18996, 18998, 18999), class = "Date"),
usage = c(345L, 260L, 230L, 0L, 312L, 230L, 232L, 132L, 125L,
345L, 321L, 431L), result = c(1L, 1L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 1L, 1L, 1L)), row.names = c(NA, -12L), class = "data.frame")
CodePudding user response:
Allan's answer is very nice. However, as I faced difficulties with expand()
and complete()
in the past, I made a custom function that I often use now. Maybe you will find it useful (and if you improve it please let me know ;-) ).
See the example below, using df
from Allan's post:
autocomplete <- function(.data, .nestingvars, .indexvar) {
# initialize the index that will expand the groups formed in .data by .nestingvars
index <- data.frame(index = seq.Date(from = min(.data[[.indexvar]]),
to = max(.data[[.indexvar]]),
by = "day"))
names(index)[1] <- .indexvar
# peform a cross join to get all possible combinations of .nestingvars and .indexvars
index <- base::merge(index, unique(.data[.nestingvars]), by = NULL)
# merge index and data
out <- base::merge(.data, index, by = c(.nestingvars, .indexvar), all.y = TRUE)
return(out)
}
library(dplyr)
split(df, ~ name, drop = TRUE) %>%
purrr::map(.x = .,
.f = ~ autocomplete(.data = .,
.nestingvars = c("name", "code"),
.indexvar = "date")) %>%
bind_rows()
#> name code date usage result
#> 1 Jennifer Aniston 23211 2021-11-04 345 1
#> 2 Jennifer Aniston 23211 2021-11-05 260 1
#> 3 Jennifer Aniston 23211 2021-11-06 230 0
#> 4 Jennifer Aniston 23211 2021-11-07 0 0
#> 5 Lisa Kudrow 55120 2022-01-01 132 0
#> 6 Lisa Kudrow 55120 2022-01-02 125 0
#> 7 Lisa Kudrow 55120 2022-01-03 NA NA
#> 8 Lisa Kudrow 55120 2022-01-04 345 1
#> 9 Lisa Kudrow 55120 2022-01-05 NA NA
#> 10 Lisa Kudrow 55120 2022-01-06 321 1
#> 11 Lisa Kudrow 55120 2022-01-07 431 1
#> 12 Matthew Perry 44215 2022-10-01 312 1
#> 13 Matthew Perry 44215 2022-10-02 NA NA
#> 14 Matthew Perry 44215 2022-10-03 NA NA
#> 15 Matthew Perry 44215 2022-10-04 230 0
#> 16 Matthew Perry 44215 2022-10-05 232 0
Created on 2022-06-02 by the reprex package (v2.0.1)