I need to create a variable that counts the number of observations that have occurred in the last 30 days for each id.
For example, imagine an observation that occurs on 1/2/2021 (d / m / y) for the id "a". If this observation is the first between 1/1/2021 and 1/2/2021 for the id "a" the variable must give 1. If it is the second, 2, etc.
Here is a larger example:
dat <- tibble::tribble(
~id, ~q, ~date,
"a", 1, "01/01/2021",
"a", 1, "01/01/2021",
"a", 1, "21/01/2021",
"a", 1, "21/01/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"b", 1, "02/02/2021",
"b", 1, "02/02/2021",
"b", 1, "22/02/2021",
"b", 1, "22/02/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021")
dat$date <- lubridate::dmy(dat$date)
The result should be:
id q date newvar
a 1 01/01/2021 1
a 1 01/01/2021 2
a 1 21/01/2021 3
a 1 21/01/2021 4
a 1 12/02/2021 3
a 1 12/02/2021 4
a 1 12/02/2021 5
a 1 12/02/2021 6
b 1 02/02/2021 1
b 1 02/02/2021 2
b 1 22/02/2021 3
b 1 22/02/2021 4
b 1 13/03/2021 3
b 1 13/03/2021 4
b 1 13/03/2021 5
b 1 13/03/2021 6
Thank you very much.
CodePudding user response:
With sapply
and between
, count the number of observations prior to the current observation that are within 30 days.
library(lubridate)
library(dplyr)
dat %>%
group_by(id) %>%
mutate(newvar = sapply(seq(length(date)),
function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
# A tibble: 16 x 4
# Groups: id [2]
id q date newvar
<chr> <dbl> <date> <int>
1 a 1 2021-01-01 1
2 a 1 2021-01-01 2
3 a 1 2021-01-21 3
4 a 1 2021-01-21 4
5 a 1 2021-02-12 3
6 a 1 2021-02-12 4
7 a 1 2021-02-12 5
8 a 1 2021-02-12 6
9 b 1 2021-02-02 1
10 b 1 2021-02-02 2
11 b 1 2021-02-22 3
12 b 1 2021-02-22 4
13 b 1 2021-03-13 3
14 b 1 2021-03-13 4
15 b 1 2021-03-13 5
16 b 1 2021-03-13 6
CodePudding user response:
Left join dat to itself on the indicated condition grouping by the rows of the left hand data frame. We assume that you want a 30 day window ending at current row but if you wanted 30 days ago (31 day window) then change 29 to 30. Both give the same result for this data.
library(sqldf)
sqldf("select a.*, count(b.date) as newvar
from dat a left join dat b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid")
giving:
id q date newvar
1 a 1 2021-01-01 1
2 a 1 2021-01-01 2
3 a 1 2021-01-21 3
4 a 1 2021-01-21 4
5 a 1 2021-02-12 3
6 a 1 2021-02-12 4
7 a 1 2021-02-12 5
8 a 1 2021-02-12 6
9 b 1 2021-02-02 1
10 b 1 2021-02-02 2
11 b 1 2021-02-22 3
12 b 1 2021-02-22 4
13 b 1 2021-03-13 3
14 b 1 2021-03-13 4
15 b 1 2021-03-13 5
16 b 1 2021-03-13 6
To write it in a pipeline using [.] to denote the input data frame works.
dat %>% {
sqldf("select a.*, count(b.date) as newvar
from [.] a left join [.] b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid")
}
This runs roughly twice as fast as sapply on the data in the question.
library(microbenchmark)
microbenchmark(
sqldf = sqldf("select a.*, count(b.date) as newvar
from dat a left join dat b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid"),
sapply = dat %>%
group_by(id) %>%
mutate(newvar = sapply(seq(length(date)),
function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
)
giving:
Unit: milliseconds
expr min lq mean median uq max neval cld
sqldf 26.2768 26.77340 27.97039 27.0082 27.29515 63.1032 100 a
sapply 42.8800 43.69345 48.53094 44.1089 45.25275 285.4861 100 b