Home > Back-end >  Count observations over rolling 30 day window
Count observations over rolling 30 day window

Time:02-18

I need to create a variable that counts the number of observations that have occurred in the last 30 days for each id.

For example, imagine an observation that occurs on 1/2/2021 (d / m / y) for the id "a". If this observation is the first between 1/1/2021 and 1/2/2021 for the id "a" the variable must give 1. If it is the second, 2, etc.

Here is a larger example:

dat <- tibble::tribble(
  ~id,  ~q,   ~date,
  "a",   1,   "01/01/2021",
  "a",   1,   "01/01/2021",
  "a",   1,   "21/01/2021",
  "a",   1,   "21/01/2021",
  "a",   1,   "12/02/2021",
  "a",   1,   "12/02/2021",
  "a",   1,   "12/02/2021",
  "a",   1,   "12/02/2021",
  "b",   1,   "02/02/2021",
  "b",   1,   "02/02/2021",
  "b",   1,   "22/02/2021",
  "b",   1,   "22/02/2021",
  "b",   1,   "13/03/2021",
  "b",   1,   "13/03/2021",
  "b",   1,   "13/03/2021",
  "b",   1,   "13/03/2021")
dat$date <- lubridate::dmy(dat$date)

The result should be:

id  q   date    newvar
a   1   01/01/2021  1
a   1   01/01/2021  2
a   1   21/01/2021  3
a   1   21/01/2021  4
a   1   12/02/2021  3
a   1   12/02/2021  4
a   1   12/02/2021  5
a   1   12/02/2021  6
b   1   02/02/2021  1
b   1   02/02/2021  2
b   1   22/02/2021  3
b   1   22/02/2021  4
b   1   13/03/2021  3
b   1   13/03/2021  4
b   1   13/03/2021  5
b   1   13/03/2021  6

Thank you very much.

CodePudding user response:

With sapply and between, count the number of observations prior to the current observation that are within 30 days.

library(lubridate)
library(dplyr)
dat %>% 
  group_by(id) %>% 
  mutate(newvar = sapply(seq(length(date)), 
                         function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))

# A tibble: 16 x 4
# Groups:   id [2]
   id        q date       newvar
   <chr> <dbl> <date>      <int>
 1 a         1 2021-01-01      1
 2 a         1 2021-01-01      2
 3 a         1 2021-01-21      3
 4 a         1 2021-01-21      4
 5 a         1 2021-02-12      3
 6 a         1 2021-02-12      4
 7 a         1 2021-02-12      5
 8 a         1 2021-02-12      6
 9 b         1 2021-02-02      1
10 b         1 2021-02-02      2
11 b         1 2021-02-22      3
12 b         1 2021-02-22      4
13 b         1 2021-03-13      3
14 b         1 2021-03-13      4
15 b         1 2021-03-13      5
16 b         1 2021-03-13      6

CodePudding user response:

Left join dat to itself on the indicated condition grouping by the rows of the left hand data frame. We assume that you want a 30 day window ending at current row but if you wanted 30 days ago (31 day window) then change 29 to 30. Both give the same result for this data.

library(sqldf)

sqldf("select a.*, count(b.date) as newvar
  from dat a left join dat b
  on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
  group by a.rowid")

giving:

   id q       date        newvar
1   a 1 2021-01-01             1
2   a 1 2021-01-01             2
3   a 1 2021-01-21             3
4   a 1 2021-01-21             4
5   a 1 2021-02-12             3
6   a 1 2021-02-12             4
7   a 1 2021-02-12             5
8   a 1 2021-02-12             6
9   b 1 2021-02-02             1
10  b 1 2021-02-02             2
11  b 1 2021-02-22             3
12  b 1 2021-02-22             4
13  b 1 2021-03-13             3
14  b 1 2021-03-13             4
15  b 1 2021-03-13             5
16  b 1 2021-03-13             6

To write it in a pipeline using [.] to denote the input data frame works.

dat %>% { 
  sqldf("select a.*, count(b.date) as newvar
    from [.] a left join [.] b
      on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
    group by a.rowid")
  }

This runs roughly twice as fast as sapply on the data in the question.

library(microbenchmark)
microbenchmark(
  sqldf = sqldf("select a.*, count(b.date) as newvar
    from dat a left join dat b
    on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
    group by a.rowid"),
  sapply = dat %>% 
    group_by(id) %>% 
    mutate(newvar = sapply(seq(length(date)), 
                         function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
)

giving:

Unit: milliseconds
   expr     min       lq     mean  median       uq      max neval cld
  sqldf 26.2768 26.77340 27.97039 27.0082 27.29515  63.1032   100  a 
 sapply 42.8800 43.69345 48.53094 44.1089 45.25275 285.4861   100   b
  • Related