How to compare two dataframes with dates, return matching dates within specific interval and tag non-CodePudding

I have a dateframe with multiple measuring dates for each subjects in each row, and another dataframe with multiple visit dates for the same subject in each row (also including some NA's).

What I want is to extract the measuring dates that match the visit dates for a certain subject within a specific interval (say /- 10 days from visit date), and tag the measuring dates that do not fall within this interval (e.g, with a 'FALSE' or -99), and keep the NA's as is.

A similar question was asked here, but did not allow for measuring dates to be within an interval period from visit date.

set.seed(1)

# Dataframe with measure dates
df1 <- rbind.data.frame(sort(sample(seq(as.Date("2018-01-01"), as.Date("2019-01-01"), by = "day"), 10)),
                        c(sort(sample(seq(as.Date("2018-06-01"), as.Date("2019-06-01"), by = "day"), 8)), NA, NA),
                        c(sort(sample(seq(as.Date("2019-06-01"), as.Date("2020-06-01"), by = "day"), 6)), rep(NA, 4)))
names(df1) <- paste("MEASUREDATE", 1:10, sep = "")

myfun <- function(x) as.Date(x, format = "%Y-%m-%d", origin = "1970-01-01")
df1 <- data.frame(lapply(df1, myfun))
df1

# Dataframe with visit dates
df2 <- rbind.data.frame(as.numeric(df1[1, 2:7]), as.numeric(c(df1[2, 4:6], NA, NA, NA)), as.numeric(c(df1[3, 1:2], rep(NA, 4))))
df2 <- data.frame(lapply(df2, myfun))
names(df2) <- paste("VISIT", 1:6, sep = "")
df2

So the fist row of the new dataframe would be like this:

# New dataframe
df3 <- df1[1, ]
df3[1] <- FALSE
df3[9:10] <- FALSE
df3

Do you know how to tackle this problem? Any help is very much appreciated.

CodePudding user response：

here is a data.table solution. In the second-to-last line, missing visitdates are set to 1-1-1970 (NA is not possible, or they would mix with the current NA.. and it will have to be a date). If the date-format is nog necessairy, you can switch to charact5er and fill use any value you like...

library(data.table)
# set as data.table
setDT(df1); setDT(df2)
# add subject numbering
df1[, id := .I]
df2[, id := .I]
# melt to long format
df1.melt <- melt(setDT(df1), id.vars = "id")
df2.melt <- melt(setDT(df2), id.vars = "id")
# add margins arround visit dates
df2.melt[, `:=`(mindate = value - 10, maxdate = value   10)][]
# join visitdays within 10 days of measure (non-equi join)
df1.melt[df2.melt, visitdate := i.value, on = .(id, value >= mindate, value <= maxdate)]
# set missing visitdates to 31-12-2099 (keep date format)
df1.melt[!is.na(value) & is.na(visitdate), visitdate := 0]
# last step is to cast to wide again
dcast(df1.melt, id ~ variable, value.var = "visitdate")

#    id MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
# 1:  1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-10-04   2018-10-04   2018-10-26   2018-10-26   1970-01-01    1970-01-01
# 2:  2   1970-01-01   1970-01-01   1970-01-01   2018-11-12   2019-01-03   2019-01-03   1970-01-01   1970-01-01         <NA>          <NA>
# 3:  3   2019-08-28   2020-03-15   2020-03-15   1970-01-01   1970-01-01   1970-01-01         <NA>         <NA>         <NA>          <NA>

CodePudding user response：

As Wimpel said, you cannot have a logical and a Date in the same column. So I will use 1970-01-01 as the FALSE value.

A solution using dplyr

library(dplyr)
# convert a row from a Date dataframe to a Date vector
convert_to_vector <- function(row){
  return(row %>% t %>% as.Date)
}
# given a Date vector where columns 1:10 are measurement date and
# 11:16 visit dates, create a logical vector of length 10 where
# the value is TRUE if the corresponding measurement column
# is within 10 days of any of the visit dates
check_within_10d <- function(row){
  return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, any))
}
# temporary dataframe of logical values for all checks on all dates
df_lgl <- cbind(df1,df2) %>% 
  apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>% 
  data.frame %>% 
  t
# create a result dataframe replacing logicals with corresponding dates
df3 <- df1
for(i in 1:ncol(df3)){ 
  df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
}

Output

> df3
  MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-09-27   2018-10-04   2018-10-26   2018-11-03   1970-01-01    1970-01-01
2         <NA>         <NA>         <NA>   2018-11-12   2018-12-30   2019-01-03         <NA>         <NA>         <NA>          <NA>
3   2019-08-28   2020-03-15   2020-03-16         <NA>         <NA>         <NA>         <NA>         <NA>         <NA>          <NA>

Some NA values are there because some visit date are NA. So the check_within_10d function cannot be sure that one of the missing visit dates is within 10 dates of a measurement date.

If you want to ignore the missing visit dates in your check, use

convert_to_vector <- function(row){
  return(row %>% t %>% as.Date)
}
# changed function to any(na.rm=TRUE)
check_within_10d <- function(row){
  return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, function(x){any(x,na.rm=T)}))
}
df_lgl <- cbind(df1,df2) %>% 
  apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>% 
  data.frame %>% 
  t
# replace missing measurement values to NA
df3 <- df1
for(i in 1:ncol(df3)){ 
  df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
  df3[,i] <- if_else(is.na(df1[,i]), df1[,i], df3[,i])
}

Output

> df3
  MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-09-27   2018-10-04   2018-10-26   2018-11-03   1970-01-01    1970-01-01
2   1970-01-01   1970-01-01   1970-01-01   2018-11-12   2018-12-30   2019-01-03   1970-01-01   1970-01-01         <NA>          <NA>
3   2019-08-28   2020-03-15   2020-03-16   1970-01-01   1970-01-01   1970-01-01         <NA>         <NA>         <NA>          <NA>