Home > database >  Counting occurance based on timeframes in a second data frame
Counting occurance based on timeframes in a second data frame

Time:11-17

this is doing my head in and wonder if I could get some advice.

I have 2 data frames

df1 = data.frame("ID" = c("a", "b", "a", "c", "a", "b"), 
                 "date" = as.Date(c("2020-1-1", "2020-1-2", "2020-1-5","2020-1-10", "2020-1-10", "2020-1-15"))

df2 = data.frame("ID" = c("a", "b", "a", "c"), 
                 "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), 
                 "end" = as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))

For each row in df2, I would like to count the number of observations in df1 in that occur within the stated start/end dates.

The result should look like this:

df3 = data.frame("ID" = c("a", "b", "a", "c"), 
                 "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), 
                 "end" = as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")), 
                 "count" = c("1", "1", "2", "1"))

Most of the solutions I've seen involve some group_by or merge function, which gets messy when there are multiple occurances of the same ID in df2 that I want to preserve.

Any suggestions? Thanks!

CodePudding user response:

We can do it this way: First join by ID (using inner_join), then within an ifelse statement check if date is between start and end. The next grouping is essentiell to get OP's desired output and finally summarise:

library(dplyr)
inner_join(df1, df2, by = "ID") %>% 
  rowwise() %>%
  mutate(match = ifelse(between(date, start, end), 1, 0)) %>%
  group_by(ID, start, end) %>% 
  summarise(n = sum(match))
  ID    start      end            n
  <chr> <date>     <date>     <dbl>
1 a     2019-12-30 2020-01-03     1
2 a     2020-01-05 2020-01-12     2
3 b     2020-01-01 2020-01-02     1
4 c     2020-01-10 2020-01-14     1
  •  Tags:  
  • r
  • Related