I would like to select observations by individual time ranges - my simulated data may explain, what I need:
There are two data frames:
- event_data: contains individuals (persons) who experience an event on a certain date. Every event is a new row, but an individual (identified by "event_person_id" may also experience more than one event (maybe 2, 3, 4, 5,...).
- visit_data: contains all personal contacts with individuals. Every visit is assigned to an individual ("visit_person_id") and has its own visit_date. There are far more visits than events.
Visit_data contains more data than I need in the end, because I only want to select those rows in visit_data, which occur in the time range between "event_date" plus two years.
For example: Person_id 1 has two events - the first on March 21st 2014 and the second on August 8th 2018, so I would like to select only those rows in visit_data which date between March 21st 2014 and March 21st 2016 as well as between August 8th 2018 and August 8th 2020.
My idea was to join event_data and visit_data by person_id - so I have a new data frame which contains visit_date as well as event_date and then I could select the relevant information, but my method is not completely correct as it (randomly?) assigns the first or the second event_date to the visit rows, f.e. person_id 5 has a visit on 1988-12-15 but has the event_date of 2019-09-03 instead of 1988-03-04 and so this visit is ignored, because 2019 is not in the range of 1988 to 1990.
If my idea of joining the tables is completely wrong or if there is a smarter way I would like to learn this better method!
Here is my code:
library(dplyr)
library(lubridate)
set.seed(123)
event_data <- data.frame(event_person_id = seq(1, 100, 1),
event_date = sample(seq(as.Date('1980/01/01'), as.Date('2010/12/31'), by="day"), 100),
age = round(runif(100, min = 1, max = 80)),
bmi = round(runif(100, min = 19, max = 30)),
amount = round(runif(100, min = 10, max = 10000)),
stringsAsFactors = FALSE)
event_data2 <- data.frame(event_person_id = seq(1, 10, 1),
event_date = sample(seq(as.Date('2011/01/01'), as.Date('2020/12/31'), by="day"), 10),
age = round(runif(10, min = 1, max = 80)),
bmi = round(runif(10, min = 19, max = 30)),
amount = round(runif(10, min = 10, max = 10000)),
stringsAsFactors = FALSE)
event_data_total <- rbind(event_data, event_data2)
visit_data <- data.frame(visit_person_id = round(runif(10000, min = 1, max = 100)),
visit_id = seq(1, 10000, 1),
visit_date = sample(seq(as.Date('1980/01/01'), as.Date('2020/12/31'), by="day"), 10000),
var1 = round(runif(10000, min = 1, max = 500)),
var2 = round(runif(10000, min = 1, max = 1000)),
var3 = round(runif(10000, min = 1, max = 9000)),
stringsAsFactors = FALSE)
data_joined <-
dplyr::inner_join(visit_data, event_data_total, by = c('visit_person_id' = 'event_person_id')) %>%
arrange(visit_date)
data_joined_final <- data_joined %>%
filter(visit_date > event_date & visit_date < (event_date %m % years(2)))
I would really be thankful for any help :)
CodePudding user response:
Fairly new at helping so have mercy ;) But as far as I understand you, a solution could come from the "data.table" "foverlaps". Which joins by overlapping dates and ID. I'm not 100% sure how exactly you would like your final data, but here comes a suggestion, that you can perhaps modify to your specific need.
library(dplyr)
library(lubridate)
library(data.table)
set.seed(123)
event_data <- data.frame(event_person_id = seq(1, 100, 1),
event_date = sample(seq(as.Date('1980/01/01'), as.Date('2010/12/31'), by="day"), 100),
age = round(runif(100, min = 1, max = 80)),
bmi = round(runif(100, min = 19, max = 30)),
amount = round(runif(100, min = 10, max = 10000)),
stringsAsFactors = FALSE)
event_data2 <- data.frame(event_person_id = seq(1, 10, 1),
event_date = sample(seq(as.Date('2011/01/01'), as.Date('2020/12/31'), by="day"), 10),
age = round(runif(10, min = 1, max = 80)),
bmi = round(runif(10, min = 19, max = 30)),
amount = round(runif(10, min = 10, max = 10000)),
stringsAsFactors = FALSE)
event_data_total <- rbind(event_data, event_data2)
visit_data <- data.frame(visit_person_id = round(runif(10000, min = 1, max = 100)),
visit_id = seq(1, 10000, 1),
visit_date = sample(seq(as.Date('1980/01/01'), as.Date('2020/12/31'), by="day"), 10000),
var1 = round(runif(10000, min = 1, max = 500)),
var2 = round(runif(10000, min = 1, max = 1000)),
var3 = round(runif(10000, min = 1, max = 9000)),
stringsAsFactors = FALSE)
#create the end dates 2 years
event_data_total$end_date <- event_data_total$event_date years(2)
#set as data.table
DT1 <- data.table(visit_data)
DT2 <- data.table(event_data_total)
#set joining keys
setkey(DT2, event_person_id, event_date, end_date)
#create dublicate columns with the same names, used for foverlaps
DT1[, c("event_date", "end_date") := visit_date]
DT1[, c("event_person_id") := visit_person_id]
#join data
data_joined<-foverlaps(DT1, DT2)
# now you should be able to sort it with e.g.
dat <- data_joined[complete.cases(data_joined), ]