I'm working with a large data frame similar to the one below. I'd like to flag all observations that have an observation 30 days earlier by ID. I had originally been trying to do a fuzzyjoin to achieve this, but can't seem to nail down where I'm going wrong with {data.table}. Any tips?
library(tidyverse)
library(magrittr)
library(data.table)
df<-tibble(
date=sample(seq(as.Date('1999/01/01'), as.Date('1999/06/01'), by="day"), 300,replace=T),
id=sample(seq(1:3),300,replace=T),claim_id=1:300)
df%<>%data.table()
df_index<-df
df_readmit<-df
names(df_index)[c(1,3)]<-c("index_date","index_id")
names(df_readmit)[c(1,3)]<-c("readmit_date","readmit_id")
df_readmit[df_index,.(id,index_date,readmit_date,index_id,readmit_id),
on=.(id,readmit_date>index_date),nomatch=0]
CodePudding user response:
If order can be changed, then I suggest we just look at the diff
of the dates.
library(data.table)
setorder(df, date)
df[,.SD[c(TRUE, diff(date) > 30),], by = id]
# id date claim_id
# <int> <Date> <int>
# 1: 1 1999-01-01 231
# 2: 2 1999-01-02 284
# 3: 3 1999-01-03 78
In this case, because 100 days spanning 6 months is very unlikely to have a 30-day span untouched, none of the sample data has such an occurrence. However, perhaps the method works for you with your real data.