this is a bit of a tricky one. I have two data frames and need help with creating a function or some sort of loop to help me with identifying if a value in data.frame x is between two column values in data.frame y.
So, for example:
data.frame x:
x <- structure(list(ID = c(1L, 1L, 3L, 2L, 2L), GroupID = c(45L,65L, 45L, 65L,45L), DateStart = c("2/11/2021",
"2/14/2021", "2/10/2021, "2/16/2021","2/19/2021"), DateEnd = c("2/13/2021",
"2/15/2021", "2/14/2021","2/18/2021", "2/22/2021")),
class = "data.frame", row.names = c(NA, -4L))
x
ID GroupID DateStart DateEnd
1 1 45 2/11/2021 2/13/2021
2 1 65 2/14/2021 2/15/2021
3 3 45 2/10/2021 2/14/2021
4 2 65 2/16/2021 2/18/2021
5 2 45 2/19/2021 2/22/2021
and then y
y <- structure(list(ID = c(1L, 1L, 1L, 1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
2L 2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,
3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L), GroupID = c(45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,
45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,
45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L),
DateStart = c("2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021",
"2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021",
"2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021")),
class = "data.frame", row.names = c(NA, -42L))
y
ID GroupID DateStart
1 1 45 2/11/2021
2 1 65 2/11/2021
3 1 45 2/12/2021
4 1 65 2/12/2021
5 1 45 2/13/2021
6 1 65 2/13/2021
7 1 45 2/14/2021
8 1 65 2/14/2021
9 1 45 2/15/2021
10 1 65 2/15/2021
11 1 45 2/16/2021
12 1 65 2/16/2021
13 1 45 2/17/2021
14 1 65 2/17/2021
15 2 45 2/11/2021
16 2 65 2/11/2021
17 2 45 2/12/2021
18 2 65 2/12/2021
19 2 45 2/13/2021
20 2 65 2/13/2021
21 2 45 2/14/2021
22 2 65 2/14/2021
23 2 45 2/15/2021
24 2 65 2/15/2021
25 2 45 2/16/2021
26 2 65 2/16/2021
27 2 45 2/17/2021
28 2 65 2/17/2021
29 3 45 2/11/2021
30 3 65 2/11/2021
31 3 45 2/12/2021
32 3 65 2/12/2021
33 3 45 2/13/2021
34 3 65 2/13/2021
35 3 45 2/14/2021
36 3 65 2/14/2021
37 3 45 2/15/2021
38 3 65 2/15/2021
39 3 45 2/16/2021
40 3 65 2/16/2021
41 3 45 2/17/2021
42 3 65 2/17/2021
What I'm hoping to end up with:
y
ID GroupID DateStart Dummy
1 1 45 2/11/2021 1
2 1 65 2/11/2021 NA
3 1 45 2/12/2021 1
4 1 65 2/12/2021 NA
5 1 45 2/13/2021 1
6 1 65 2/13/2021 NA
7 1 45 2/14/2021 NA
8 1 65 2/14/2021 1
9 1 45 2/15/2021 NA
10 1 65 2/15/2021 1
11 1 45 2/16/2021 NA
12 1 65 2/16/2021 NA
13 1 45 2/17/2021 NA
14 1 65 2/17/2021 NA
15 2 45 2/11/2021 NA
16 2 65 2/11/2021 NA
17 2 45 2/12/2021 NA
18 2 65 2/12/2021 NA
19 2 45 2/13/2021 NA
20 2 65 2/13/2021 NA
21 2 45 2/14/2021 NA
22 2 65 2/14/2021 NA
23 2 45 2/15/2021 NA
24 2 65 2/15/2021 NA
25 2 45 2/16/2021 NA
26 2 65 2/16/2021 1
27 2 45 2/17/2021 NA
28 2 65 2/17/2021 1
29 3 45 2/11/2021 1
30 3 65 2/11/2021 NA
31 3 45 2/12/2021 1
32 3 65 2/12/2021 NA
33 3 45 2/13/2021 1
34 3 65 2/13/2021 NA
35 3 45 2/14/2021 1
36 3 65 2/14/2021 NA
37 3 45 2/15/2021 NA
38 3 65 2/15/2021 NA
39 3 45 2/16/2021 NA
40 3 65 2/16/2021 NA
41 3 45 2/17/2021 NA
42 3 65 2/17/2021 NA
What the resulting y data.frame gives me is a new 4th column where we have a 1 if a date falls between the DateStart and DateEnd in data.frame x, grouped by GroupID and ID. For example, for every ID=3 for GroupID=45 between the dates (inclusive of) 2/10/2021 and 2/14/2021 in y, I want the loop or function to specify 1 in the dummy column. For those dates in y that don't correspond to the conditions in x, I want an NA.
I essentially need, for every row in y, to read through every row in x, and give me a 1 if, conditional on groupID and ID, the date falls within the range specified.
My real dataset is pretty large (approx 2 million observations) so I'm looking for a quick way to do this as well.
I tried variations on this:R: Check if value from dataframe is within range other dataframe
But no dice.
Thanks in advance!
CodePudding user response:
- Do
left_join
- Use
mdy
function to get class date group_by
and operate row wisemutate
Dumsy
in aifelse
usingbetween
function:
library(dplyr)
library(lubridate)
left_join(y, x, by=c("ID", "GroupID")) %>%
mutate(across(starts_with("Date"), mdy)) %>%
group_by(ID, GroupID) %>%
rowwise() %>%
mutate(Dumsy = ifelse(between(DateStart.x, DateStart.y, DateEnd), 1, NA)) %>%
select(ID, GroupID, DateStart=DateStart.x, Dumsy)
ID GroupID DateStart Dumsy
<int> <int> <date> <dbl>
1 1 45 2021-02-11 1
2 1 65 2021-02-11 NA
3 1 45 2021-02-12 1
4 1 65 2021-02-12 NA
5 1 45 2021-02-13 1
6 1 65 2021-02-13 NA
7 1 45 2021-02-14 NA
8 1 65 2021-02-14 1
9 1 45 2021-02-15 NA
10 1 65 2021-02-15 1
# ... with 32 more rows
CodePudding user response:
What you need is a left join between y
and x
so that each row for y
has the appropriate date cutoffs. After that, the dummy variable is created from a straightforward ifelse()
. Here's a solution using dplyr
library(tidyverse)
y %>%
# Rename for clarity
rename(date = DateStart) %>%
left_join(x, by = c("ID", "GroupID")) %>%
# Convert all the date columns to dates
mutate(across(c(date, DateStart, DateEnd), as.Date, format = "%m/%d/%Y")) %>%
mutate(dummy = ifelse(date >= DateStart & date <= DateEnd, 1, NA))
Output (note I converted your data frames into tibbles):
# A tibble: 42 x 6
ID GroupID date DateStart DateEnd dummy
<int> <int> <date> <date> <date> <dbl>
1 1 45 2021-02-11 2021-02-11 2021-02-13 1
2 1 65 2021-02-11 2021-02-14 2021-02-15 NA
3 1 45 2021-02-12 2021-02-11 2021-02-13 1
4 1 65 2021-02-12 2021-02-14 2021-02-15 NA
5 1 45 2021-02-13 2021-02-11 2021-02-13 1
6 1 65 2021-02-13 2021-02-14 2021-02-15 NA
7 1 45 2021-02-14 2021-02-11 2021-02-13 NA
8 1 65 2021-02-14 2021-02-14 2021-02-15 1
9 1 45 2021-02-15 2021-02-11 2021-02-13 NA
10 1 65 2021-02-15 2021-02-14 2021-02-15 1
# ... with 32 more rows