Home > Software engineering >  How do I find if a date in the first DF falls within the range of dates in another data frame?
How do I find if a date in the first DF falls within the range of dates in another data frame?

Time:12-18

this is a bit of a tricky one. I have two data frames and need help with creating a function or some sort of loop to help me with identifying if a value in data.frame x is between two column values in data.frame y.

So, for example:

data.frame x:

x <- structure(list(ID = c(1L, 1L, 3L,  2L, 2L), GroupID = c(45L,65L, 45L, 65L,45L), DateStart = c("2/11/2021", 
"2/14/2021", "2/10/2021, "2/16/2021","2/19/2021"), DateEnd = c("2/13/2021", 
"2/15/2021", "2/14/2021","2/18/2021", "2/22/2021")), 
class = "data.frame", row.names = c(NA, -4L))

x

  ID GroupID DateStart   DateEnd
1  1      45 2/11/2021 2/13/2021
2  1      65 2/14/2021 2/15/2021
3  3      45 2/10/2021 2/14/2021
4  2      65 2/16/2021 2/18/2021
5  2      45 2/19/2021 2/22/2021

and then y

y <- structure(list(ID = c(1L, 1L, 1L, 1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
2L 2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L, 
3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L), GroupID = c(45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,
45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,
45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L), 
DateStart = c("2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021", 
"2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021",
"2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021")), 
class = "data.frame", row.names = c(NA, -42L))

y

ID GroupID DateStart
1   1      45 2/11/2021
2   1      65 2/11/2021
3   1      45 2/12/2021
4   1      65 2/12/2021
5   1      45 2/13/2021
6   1      65 2/13/2021
7   1      45 2/14/2021
8   1      65 2/14/2021
9   1      45 2/15/2021
10  1      65 2/15/2021
11  1      45 2/16/2021
12  1      65 2/16/2021
13  1      45 2/17/2021
14  1      65 2/17/2021
15  2      45 2/11/2021
16  2      65 2/11/2021
17  2      45 2/12/2021
18  2      65 2/12/2021
19  2      45 2/13/2021
20  2      65 2/13/2021
21  2      45 2/14/2021
22  2      65 2/14/2021
23  2      45 2/15/2021
24  2      65 2/15/2021
25  2      45 2/16/2021
26  2      65 2/16/2021
27  2      45 2/17/2021
28  2      65 2/17/2021
29  3      45 2/11/2021
30  3      65 2/11/2021
31  3      45 2/12/2021
32  3      65 2/12/2021
33  3      45 2/13/2021
34  3      65 2/13/2021
35  3      45 2/14/2021
36  3      65 2/14/2021
37  3      45 2/15/2021
38  3      65 2/15/2021
39  3      45 2/16/2021
40  3      65 2/16/2021
41  3      45 2/17/2021
42  3      65 2/17/2021

What I'm hoping to end up with:

y
   ID GroupID DateStart Dummy
1   1      45 2/11/2021     1
2   1      65 2/11/2021    NA
3   1      45 2/12/2021     1
4   1      65 2/12/2021    NA
5   1      45 2/13/2021     1
6   1      65 2/13/2021    NA
7   1      45 2/14/2021    NA
8   1      65 2/14/2021     1
9   1      45 2/15/2021    NA
10  1      65 2/15/2021     1
11  1      45 2/16/2021    NA
12  1      65 2/16/2021    NA
13  1      45 2/17/2021    NA
14  1      65 2/17/2021    NA
15  2      45 2/11/2021    NA
16  2      65 2/11/2021    NA
17  2      45 2/12/2021    NA
18  2      65 2/12/2021    NA
19  2      45 2/13/2021    NA
20  2      65 2/13/2021    NA
21  2      45 2/14/2021    NA
22  2      65 2/14/2021    NA
23  2      45 2/15/2021    NA
24  2      65 2/15/2021    NA
25  2      45 2/16/2021    NA
26  2      65 2/16/2021     1
27  2      45 2/17/2021    NA
28  2      65 2/17/2021     1
29  3      45 2/11/2021     1
30  3      65 2/11/2021    NA
31  3      45 2/12/2021     1
32  3      65 2/12/2021    NA
33  3      45 2/13/2021     1
34  3      65 2/13/2021    NA
35  3      45 2/14/2021     1
36  3      65 2/14/2021    NA
37  3      45 2/15/2021    NA
38  3      65 2/15/2021    NA
39  3      45 2/16/2021    NA
40  3      65 2/16/2021    NA
41  3      45 2/17/2021    NA
42  3      65 2/17/2021    NA

What the resulting y data.frame gives me is a new 4th column where we have a 1 if a date falls between the DateStart and DateEnd in data.frame x, grouped by GroupID and ID. For example, for every ID=3 for GroupID=45 between the dates (inclusive of) 2/10/2021 and 2/14/2021 in y, I want the loop or function to specify 1 in the dummy column. For those dates in y that don't correspond to the conditions in x, I want an NA.

I essentially need, for every row in y, to read through every row in x, and give me a 1 if, conditional on groupID and ID, the date falls within the range specified.

My real dataset is pretty large (approx 2 million observations) so I'm looking for a quick way to do this as well.

I tried variations on this:R: Check if value from dataframe is within range other dataframe

But no dice.

Thanks in advance!

CodePudding user response:

  1. Do left_join
  2. Use mdy function to get class date
  3. group_by and operate row wise
  4. mutate Dumsy in a ifelse using between function:
library(dplyr)
library(lubridate)

left_join(y, x, by=c("ID", "GroupID")) %>% 
  mutate(across(starts_with("Date"), mdy)) %>% 
  group_by(ID, GroupID) %>% 
  rowwise() %>% 
  mutate(Dumsy = ifelse(between(DateStart.x, DateStart.y, DateEnd), 1, NA)) %>% 
  select(ID, GroupID, DateStart=DateStart.x, Dumsy)
     ID GroupID DateStart  Dumsy
   <int>   <int> <date>     <dbl>
 1     1      45 2021-02-11     1
 2     1      65 2021-02-11    NA
 3     1      45 2021-02-12     1
 4     1      65 2021-02-12    NA
 5     1      45 2021-02-13     1
 6     1      65 2021-02-13    NA
 7     1      45 2021-02-14    NA
 8     1      65 2021-02-14     1
 9     1      45 2021-02-15    NA
10     1      65 2021-02-15     1
# ... with 32 more rows

CodePudding user response:

What you need is a left join between y and x so that each row for y has the appropriate date cutoffs. After that, the dummy variable is created from a straightforward ifelse(). Here's a solution using dplyr

library(tidyverse)
y %>% 
  # Rename for clarity
  rename(date = DateStart) %>% 
  left_join(x, by = c("ID", "GroupID")) %>% 
  # Convert all the date columns to dates
  mutate(across(c(date, DateStart, DateEnd), as.Date, format = "%m/%d/%Y")) %>% 
  mutate(dummy = ifelse(date >= DateStart & date <= DateEnd, 1, NA))

Output (note I converted your data frames into tibbles):

# A tibble: 42 x 6
      ID GroupID date       DateStart  DateEnd    dummy
   <int>   <int> <date>     <date>     <date>     <dbl>
 1     1      45 2021-02-11 2021-02-11 2021-02-13     1
 2     1      65 2021-02-11 2021-02-14 2021-02-15    NA
 3     1      45 2021-02-12 2021-02-11 2021-02-13     1
 4     1      65 2021-02-12 2021-02-14 2021-02-15    NA
 5     1      45 2021-02-13 2021-02-11 2021-02-13     1
 6     1      65 2021-02-13 2021-02-14 2021-02-15    NA
 7     1      45 2021-02-14 2021-02-11 2021-02-13    NA
 8     1      65 2021-02-14 2021-02-14 2021-02-15     1
 9     1      45 2021-02-15 2021-02-11 2021-02-13    NA
10     1      65 2021-02-15 2021-02-14 2021-02-15     1
# ... with 32 more rows
  •  Tags:  
  • r
  • Related