Home > Blockchain >  How do you populate missing dates for lag?
How do you populate missing dates for lag?

Time:11-04

Say that I have a dataset.

date <- c("2004-02-01", "2004-03-05", "2004-08-09", "2004-08-13", "2004-10-20", "2004-11-02", "2008-01-05", "2008-02-03", "2008-08-09", "2008-11-04", "2012-01-05", "2012-02-03", "2012-08-09", "2012-10-04", "2012-10-04", "2012-10-31", "2012-11-04")

date <- ymd(date)

name <- c("Joe", "Joe", "Joe", "Joe", "Joe", "Joe",
          "Larry", "Larry", "Larry", "Larry",
          "Jeff", "Jeff", "Jeff", "Jeff", "Jeff", "Jeff", "Jeff")

hits <- c(5, 4, 10, 9, 15, 1,
          13, 22, 9, 11,
          15, 17, 10, 3, 4, 2, 33)

df <- data.frame(date, name, hits)

I want to do 7-day time lags for each observation. In order to do this, I will have to restructure the dataset a bit.

I want to add seven days after each date for each name, but the hits will be 0. I hope to end up with a dataset like the following (for Joe):

date  name hits
2004-02-01   Joe    5
2004-02-02   Joe    0
2004-02-03   Joe    0
2004-02-04   Joe    0
2004-02-05   Joe    0
2004-02-06   Joe    0
2004-02-07   Joe    0
2004-02-08   Joe    0
2004-03-05   Joe    4
2004-03-06   Joe    0
2004-03-07   Joe    0
2004-03-08   Joe    0
2004-03-09   Joe    0
2004-03-10   Joe    0
2004-03-11   Joe    0
2004-03-12   Joe    0
2004-08-09   Joe   10
2004-08-10   Joe   0
2004-08-11   Joe   0
2004-08-12   Joe   0
2004-08-13   Joe   9
2004-08-14   Joe   0
2004-08-15   Joe   0
2004-08-16   Joe   0
2004-08-17   Joe   0
2004-08-18   Joe   0
2004-08-19   Joe   0
2004-08-20   Joe   0
2004-10-20   Joe    15
2004-10-21   Joe    0
2004-10-22   Joe    0
2004-10-23   Joe    0
2004-10-24   Joe    0
2004-10-25   Joe    0
2004-10-26   Joe    0
2004-10-27   Joe    0
2004-11-02   Joe    1
2004-11-03   Joe    0
2004-11-04   Joe    0
2004-11-05   Joe    0
2004-11-06   Joe    0
2004-11-07   Joe    0
2004-11-08   Joe    0
2004-11-09   Joe    0

Is there a fast way to do this using dplyr?

CodePudding user response:

The dplyr::summarise function can be used to add rows for the next 7 days for each combination of name and date:

library(tidyverse)

ndays=7

df.filled = df %>% 
  mutate(date = as.Date(date)) %>% 
  arrange(name, date) %>% 
  group_by(name, date, hits) %>%
  summarise(date = date   0:ndays, 
            hits = c(hits, rep(0, ndays))) %>% 
  ungroup()

df.filled %>% filter(name=="Joe") %>% print(n=Inf)
#> # A tibble: 48 × 3
#>    name  date        hits
#>    <chr> <date>     <dbl>
#>  1 Joe   2004-02-01     5
#>  2 Joe   2004-02-02     0
#>  3 Joe   2004-02-03     0
#>  4 Joe   2004-02-04     0
#>  5 Joe   2004-02-05     0
#>  6 Joe   2004-02-06     0
#>  7 Joe   2004-02-07     0
#>  8 Joe   2004-02-08     0
#>  9 Joe   2004-03-05     4
#> 10 Joe   2004-03-06     0
#> 11 Joe   2004-03-07     0
#> 12 Joe   2004-03-08     0
#> 13 Joe   2004-03-09     0
#> 14 Joe   2004-03-10     0
#> 15 Joe   2004-03-11     0
#> 16 Joe   2004-03-12     0
#> 17 Joe   2004-08-09    10
#> 18 Joe   2004-08-10     0
#> 19 Joe   2004-08-11     0
#> 20 Joe   2004-08-12     0
#> 21 Joe   2004-08-13     0
#> 22 Joe   2004-08-14     0
#> 23 Joe   2004-08-15     0
#> 24 Joe   2004-08-16     0
#> 25 Joe   2004-08-13     9
#> 26 Joe   2004-08-14     0
#> 27 Joe   2004-08-15     0
#> 28 Joe   2004-08-16     0
#> 29 Joe   2004-08-17     0
#> 30 Joe   2004-08-18     0
#> 31 Joe   2004-08-19     0
#> 32 Joe   2004-08-20     0
#> 33 Joe   2004-10-20    15
#> 34 Joe   2004-10-21     0
#> 35 Joe   2004-10-22     0
#> 36 Joe   2004-10-23     0
#> 37 Joe   2004-10-24     0
#> 38 Joe   2004-10-25     0
#> 39 Joe   2004-10-26     0
#> 40 Joe   2004-10-27     0
#> 41 Joe   2004-11-02     1
#> 42 Joe   2004-11-03     0
#> 43 Joe   2004-11-04     0
#> 44 Joe   2004-11-05     0
#> 45 Joe   2004-11-06     0
#> 46 Joe   2004-11-07     0
#> 47 Joe   2004-11-08     0
#> 48 Joe   2004-11-09     0

Note, however, that with the code above you could end up with repeated dates if a given name has two dates that are less than 7 days apart. Thus, it's probably safer to do the following: In the code below, we fill in every date from the first to the last 7 days for each name. Then we join that back to the original data to populate the dates that have non-zero hits.

df$date = as.Date(df$date)

df.filled2 = df %>% 
  group_by(name) %>% 
  summarise(date = seq(min(date), max(date) 7,"1 day")) %>% 
  left_join(df) %>%
  mutate(hits=replace_na(hits, 0))

df.filled2 %>% filter(name=="Joe") %>% print(n=Inf)
#> # A tibble: 283 × 3
#> # Groups:   name [1]
#>     name  date        hits
#>     <chr> <date>     <dbl>
#>   1 Joe   2004-02-01     5
#>   2 Joe   2004-02-02     0
#>   3 Joe   2004-02-03     0
#>   4 Joe   2004-02-04     0
#>   5 Joe   2004-02-05     0
#>   6 Joe   2004-02-06     0
#>   7 Joe   2004-02-07     0
#>   8 Joe   2004-02-08     0
#>   9 Joe   2004-02-09     0
#>  10 Joe   2004-02-10     0
#>  11 Joe   2004-02-11     0
#>  12 Joe   2004-02-12     0
#>  13 Joe   2004-02-13     0
#>  14 Joe   2004-02-14     0
#>  15 Joe   2004-02-15     0
#>  16 Joe   2004-02-16     0
#>  17 Joe   2004-02-17     0
#>  18 Joe   2004-02-18     0
#>  19 Joe   2004-02-19     0
#>  20 Joe   2004-02-20     0
#>  21 Joe   2004-02-21     0
#>  22 Joe   2004-02-22     0
#>  23 Joe   2004-02-23     0
#>  24 Joe   2004-02-24     0
#>  25 Joe   2004-02-25     0
#>  26 Joe   2004-02-26     0
#>  27 Joe   2004-02-27     0
#>  28 Joe   2004-02-28     0
#>  29 Joe   2004-02-29     0
#>  30 Joe   2004-03-01     0
#>  31 Joe   2004-03-02     0
#>  32 Joe   2004-03-03     0
#>  33 Joe   2004-03-04     0
#>  34 Joe   2004-03-05     4
#>  35 Joe   2004-03-06     0
#>  36 Joe   2004-03-07     0
#>  37 Joe   2004-03-08     0
#>  38 Joe   2004-03-09     0
#>  39 Joe   2004-03-10     0
#>  40 Joe   2004-03-11     0
#>  41 Joe   2004-03-12     0
#>  42 Joe   2004-03-13     0
#>  43 Joe   2004-03-14     0
#>  44 Joe   2004-03-15     0
#>  45 Joe   2004-03-16     0
#>  46 Joe   2004-03-17     0
#>  47 Joe   2004-03-18     0
#>  48 Joe   2004-03-19     0
#>  49 Joe   2004-03-20     0
#>  50 Joe   2004-03-21     0
#>  51 Joe   2004-03-22     0
#>  52 Joe   2004-03-23     0
#>  53 Joe   2004-03-24     0
#>  54 Joe   2004-03-25     0
#>  55 Joe   2004-03-26     0
#>  56 Joe   2004-03-27     0
#>  57 Joe   2004-03-28     0
#>  58 Joe   2004-03-29     0
#>  59 Joe   2004-03-30     0
#>  60 Joe   2004-03-31     0
#>  61 Joe   2004-04-01     0
#>  62 Joe   2004-04-02     0
#>  63 Joe   2004-04-03     0
#>  64 Joe   2004-04-04     0
#>  65 Joe   2004-04-05     0
#>  66 Joe   2004-04-06     0
#>  67 Joe   2004-04-07     0
#>  68 Joe   2004-04-08     0
#>  69 Joe   2004-04-09     0
#>  70 Joe   2004-04-10     0
#>  71 Joe   2004-04-11     0
#>  72 Joe   2004-04-12     0
#>  73 Joe   2004-04-13     0
#>  74 Joe   2004-04-14     0
#>  75 Joe   2004-04-15     0
#>  76 Joe   2004-04-16     0
#>  77 Joe   2004-04-17     0
#>  78 Joe   2004-04-18     0
#>  79 Joe   2004-04-19     0
#>  80 Joe   2004-04-20     0
#>  81 Joe   2004-04-21     0
#>  82 Joe   2004-04-22     0
#>  83 Joe   2004-04-23     0
#>  84 Joe   2004-04-24     0
#>  85 Joe   2004-04-25     0
#>  86 Joe   2004-04-26     0
#>  87 Joe   2004-04-27     0
#>  88 Joe   2004-04-28     0
#>  89 Joe   2004-04-29     0
#>  90 Joe   2004-04-30     0
#>  91 Joe   2004-05-01     0
#>  92 Joe   2004-05-02     0
#>  93 Joe   2004-05-03     0
#>  94 Joe   2004-05-04     0
#>  95 Joe   2004-05-05     0
#>  96 Joe   2004-05-06     0
#>  97 Joe   2004-05-07     0
#>  98 Joe   2004-05-08     0
#>  99 Joe   2004-05-09     0
#> 100 Joe   2004-05-10     0
#> 101 Joe   2004-05-11     0
#> 102 Joe   2004-05-12     0
#> 103 Joe   2004-05-13     0
#> 104 Joe   2004-05-14     0
#> 105 Joe   2004-05-15     0
#> 106 Joe   2004-05-16     0
#> 107 Joe   2004-05-17     0
#> 108 Joe   2004-05-18     0
#> 109 Joe   2004-05-19     0
#> 110 Joe   2004-05-20     0
#> 111 Joe   2004-05-21     0
#> 112 Joe   2004-05-22     0
#> 113 Joe   2004-05-23     0
#> 114 Joe   2004-05-24     0
#> 115 Joe   2004-05-25     0
#> 116 Joe   2004-05-26     0
#> 117 Joe   2004-05-27     0
#> 118 Joe   2004-05-28     0
#> 119 Joe   2004-05-29     0
#> 120 Joe   2004-05-30     0
#> 121 Joe   2004-05-31     0
#> 122 Joe   2004-06-01     0
#> 123 Joe   2004-06-02     0
#> 124 Joe   2004-06-03     0
#> 125 Joe   2004-06-04     0
#> 126 Joe   2004-06-05     0
#> 127 Joe   2004-06-06     0
#> 128 Joe   2004-06-07     0
#> 129 Joe   2004-06-08     0
#> 130 Joe   2004-06-09     0
#> 131 Joe   2004-06-10     0
#> 132 Joe   2004-06-11     0
#> 133 Joe   2004-06-12     0
#> 134 Joe   2004-06-13     0
#> 135 Joe   2004-06-14     0
#> 136 Joe   2004-06-15     0
#> 137 Joe   2004-06-16     0
#> 138 Joe   2004-06-17     0
#> 139 Joe   2004-06-18     0
#> 140 Joe   2004-06-19     0
#> 141 Joe   2004-06-20     0
#> 142 Joe   2004-06-21     0
#> 143 Joe   2004-06-22     0
#> 144 Joe   2004-06-23     0
#> 145 Joe   2004-06-24     0
#> 146 Joe   2004-06-25     0
#> 147 Joe   2004-06-26     0
#> 148 Joe   2004-06-27     0
#> 149 Joe   2004-06-28     0
#> 150 Joe   2004-06-29     0
#> 151 Joe   2004-06-30     0
#> 152 Joe   2004-07-01     0
#> 153 Joe   2004-07-02     0
#> 154 Joe   2004-07-03     0
#> 155 Joe   2004-07-04     0
#> 156 Joe   2004-07-05     0
#> 157 Joe   2004-07-06     0
#> 158 Joe   2004-07-07     0
#> 159 Joe   2004-07-08     0
#> 160 Joe   2004-07-09     0
#> 161 Joe   2004-07-10     0
#> 162 Joe   2004-07-11     0
#> 163 Joe   2004-07-12     0
#> 164 Joe   2004-07-13     0
#> 165 Joe   2004-07-14     0
#> 166 Joe   2004-07-15     0
#> 167 Joe   2004-07-16     0
#> 168 Joe   2004-07-17     0
#> 169 Joe   2004-07-18     0
#> 170 Joe   2004-07-19     0
#> 171 Joe   2004-07-20     0
#> 172 Joe   2004-07-21     0
#> 173 Joe   2004-07-22     0
#> 174 Joe   2004-07-23     0
#> 175 Joe   2004-07-24     0
#> 176 Joe   2004-07-25     0
#> 177 Joe   2004-07-26     0
#> 178 Joe   2004-07-27     0
#> 179 Joe   2004-07-28     0
#> 180 Joe   2004-07-29     0
#> 181 Joe   2004-07-30     0
#> 182 Joe   2004-07-31     0
#> 183 Joe   2004-08-01     0
#> 184 Joe   2004-08-02     0
#> 185 Joe   2004-08-03     0
#> 186 Joe   2004-08-04     0
#> 187 Joe   2004-08-05     0
#> 188 Joe   2004-08-06     0
#> 189 Joe   2004-08-07     0
#> 190 Joe   2004-08-08     0
#> 191 Joe   2004-08-09    10
#> 192 Joe   2004-08-10     0
#> 193 Joe   2004-08-11     0
#> 194 Joe   2004-08-12     0
#> 195 Joe   2004-08-13     9
#> 196 Joe   2004-08-14     0
#> 197 Joe   2004-08-15     0
#> 198 Joe   2004-08-16     0
#> 199 Joe   2004-08-17     0
#> 200 Joe   2004-08-18     0
#> 201 Joe   2004-08-19     0
#> 202 Joe   2004-08-20     0
#> 203 Joe   2004-08-21     0
#> 204 Joe   2004-08-22     0
#> 205 Joe   2004-08-23     0
#> 206 Joe   2004-08-24     0
#> 207 Joe   2004-08-25     0
#> 208 Joe   2004-08-26     0
#> 209 Joe   2004-08-27     0
#> 210 Joe   2004-08-28     0
#> 211 Joe   2004-08-29     0
#> 212 Joe   2004-08-30     0
#> 213 Joe   2004-08-31     0
#> 214 Joe   2004-09-01     0
#> 215 Joe   2004-09-02     0
#> 216 Joe   2004-09-03     0
#> 217 Joe   2004-09-04     0
#> 218 Joe   2004-09-05     0
#> 219 Joe   2004-09-06     0
#> 220 Joe   2004-09-07     0
#> 221 Joe   2004-09-08     0
#> 222 Joe   2004-09-09     0
#> 223 Joe   2004-09-10     0
#> 224 Joe   2004-09-11     0
#> 225 Joe   2004-09-12     0
#> 226 Joe   2004-09-13     0
#> 227 Joe   2004-09-14     0
#> 228 Joe   2004-09-15     0
#> 229 Joe   2004-09-16     0
#> 230 Joe   2004-09-17     0
#> 231 Joe   2004-09-18     0
#> 232 Joe   2004-09-19     0
#> 233 Joe   2004-09-20     0
#> 234 Joe   2004-09-21     0
#> 235 Joe   2004-09-22     0
#> 236 Joe   2004-09-23     0
#> 237 Joe   2004-09-24     0
#> 238 Joe   2004-09-25     0
#> 239 Joe   2004-09-26     0
#> 240 Joe   2004-09-27     0
#> 241 Joe   2004-09-28     0
#> 242 Joe   2004-09-29     0
#> 243 Joe   2004-09-30     0
#> 244 Joe   2004-10-01     0
#> 245 Joe   2004-10-02     0
#> 246 Joe   2004-10-03     0
#> 247 Joe   2004-10-04     0
#> 248 Joe   2004-10-05     0
#> 249 Joe   2004-10-06     0
#> 250 Joe   2004-10-07     0
#> 251 Joe   2004-10-08     0
#> 252 Joe   2004-10-09     0
#> 253 Joe   2004-10-10     0
#> 254 Joe   2004-10-11     0
#> 255 Joe   2004-10-12     0
#> 256 Joe   2004-10-13     0
#> 257 Joe   2004-10-14     0
#> 258 Joe   2004-10-15     0
#> 259 Joe   2004-10-16     0
#> 260 Joe   2004-10-17     0
#> 261 Joe   2004-10-18     0
#> 262 Joe   2004-10-19     0
#> 263 Joe   2004-10-20    15
#> 264 Joe   2004-10-21     0
#> 265 Joe   2004-10-22     0
#> 266 Joe   2004-10-23     0
#> 267 Joe   2004-10-24     0
#> 268 Joe   2004-10-25     0
#> 269 Joe   2004-10-26     0
#> 270 Joe   2004-10-27     0
#> 271 Joe   2004-10-28     0
#> 272 Joe   2004-10-29     0
#> 273 Joe   2004-10-30     0
#> 274 Joe   2004-10-31     0
#> 275 Joe   2004-11-01     0
#> 276 Joe   2004-11-02     1
#> 277 Joe   2004-11-03     0
#> 278 Joe   2004-11-04     0
#> 279 Joe   2004-11-05     0
#> 280 Joe   2004-11-06     0
#> 281 Joe   2004-11-07     0
#> 282 Joe   2004-11-08     0
#> 283 Joe   2004-11-09     0

The second approach will in general result in many more rows of data. If you want to keep a maximum of 7 rows after any date with non-zero hits, you can do the following:

df.filled2 = df.filled2 %>% 
  group_by(name) %>% 
  mutate(test=cumsum(hits > 0)) %>% 
  group_by(name, test) %>% 
  slice(1:8) %>%
  ungroup %>% 
  select(-test)
  • Related