this is doing my head in and wonder if I could get some advice.
I have 2 data frames
df1 = data.frame("ID" = c("a", "b", "a", "c", "a", "b"),
"date" = as.Date(c("2020-1-1", "2020-1-2", "2020-1-5","2020-1-10", "2020-1-10", "2020-1-15"))
df2 = data.frame("ID" = c("a", "b", "a", "c"),
"start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")),
"end" = as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
For each row in df2, I would like to count the number of observations in df1 in that occur within the stated start/end dates.
The result should look like this:
df3 = data.frame("ID" = c("a", "b", "a", "c"),
"start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")),
"end" = as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")),
"count" = c("1", "1", "2", "1"))
Most of the solutions I've seen involve some group_by or merge function, which gets messy when there are multiple occurances of the same ID in df2 that I want to preserve.
Any suggestions? Thanks!
CodePudding user response:
We can do it this way:
First join by ID (using inner_join
), then within an ifelse statement check if date is between start and end.
The next grouping is essentiell to get OP's desired output and finally summarise:
library(dplyr)
inner_join(df1, df2, by = "ID") %>%
rowwise() %>%
mutate(match = ifelse(between(date, start, end), 1, 0)) %>%
group_by(ID, start, end) %>%
summarise(n = sum(match))
ID start end n
<chr> <date> <date> <dbl>
1 a 2019-12-30 2020-01-03 1
2 a 2020-01-05 2020-01-12 2
3 b 2020-01-01 2020-01-02 1
4 c 2020-01-10 2020-01-14 1