Home > Net >  How to compare two dataframes with dates, return matching dates within specific interval and tag non
How to compare two dataframes with dates, return matching dates within specific interval and tag non

Time:04-08

I have a dateframe with multiple measuring dates for each subjects in each row, and another dataframe with multiple visit dates for the same subject in each row (also including some NA's).

What I want is to extract the measuring dates that match the visit dates for a certain subject within a specific interval (say /- 10 days from visit date), and tag the measuring dates that do not fall within this interval (e.g, with a 'FALSE' or -99), and keep the NA's as is.

A similar question was asked here, but did not allow for measuring dates to be within an interval period from visit date.

set.seed(1)

# Dataframe with measure dates
df1 <- rbind.data.frame(sort(sample(seq(as.Date("2018-01-01"), as.Date("2019-01-01"), by = "day"), 10)),
                        c(sort(sample(seq(as.Date("2018-06-01"), as.Date("2019-06-01"), by = "day"), 8)), NA, NA),
                        c(sort(sample(seq(as.Date("2019-06-01"), as.Date("2020-06-01"), by = "day"), 6)), rep(NA, 4)))
names(df1) <- paste("MEASUREDATE", 1:10, sep = "")

myfun <- function(x) as.Date(x, format = "%Y-%m-%d", origin = "1970-01-01")
df1 <- data.frame(lapply(df1, myfun))
df1

# Dataframe with visit dates
df2 <- rbind.data.frame(as.numeric(df1[1, 2:7]), as.numeric(c(df1[2, 4:6], NA, NA, NA)), as.numeric(c(df1[3, 1:2], rep(NA, 4))))
df2 <- data.frame(lapply(df2, myfun))
names(df2) <- paste("VISIT", 1:6, sep = "")
df2

So the fist row of the new dataframe would be like this:

# New dataframe
df3 <- df1[1, ]
df3[1] <- FALSE
df3[9:10] <- FALSE
df3

Do you know how to tackle this problem? Any help is very much appreciated.

CodePudding user response:

here is a data.table solution. In the second-to-last line, missing visitdates are set to 1-1-1970 (NA is not possible, or they would mix with the current NA.. and it will have to be a date). If the date-format is nog necessairy, you can switch to charact5er and fill use any value you like...

library(data.table)
# set as data.table
setDT(df1); setDT(df2)
# add subject numbering
df1[, id := .I]
df2[, id := .I]
# melt to long format
df1.melt <- melt(setDT(df1), id.vars = "id")
df2.melt <- melt(setDT(df2), id.vars = "id")
# add margins arround visit dates
df2.melt[, `:=`(mindate = value - 10, maxdate = value   10)][]
# join visitdays within 10 days of measure (non-equi join)
df1.melt[df2.melt, visitdate := i.value, on = .(id, value >= mindate, value <= maxdate)]
# set missing visitdates to 31-12-2099 (keep date format)
df1.melt[!is.na(value) & is.na(visitdate), visitdate := 0]
# last step is to cast to wide again
dcast(df1.melt, id ~ variable, value.var = "visitdate")

#    id MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
# 1:  1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-10-04   2018-10-04   2018-10-26   2018-10-26   1970-01-01    1970-01-01
# 2:  2   1970-01-01   1970-01-01   1970-01-01   2018-11-12   2019-01-03   2019-01-03   1970-01-01   1970-01-01         <NA>          <NA>
# 3:  3   2019-08-28   2020-03-15   2020-03-15   1970-01-01   1970-01-01   1970-01-01         <NA>         <NA>         <NA>          <NA>

CodePudding user response:

As Wimpel said, you cannot have a logical and a Date in the same column. So I will use 1970-01-01 as the FALSE value.

A solution using dplyr

library(dplyr)
# convert a row from a Date dataframe to a Date vector
convert_to_vector <- function(row){
  return(row %>% t %>% as.Date)
}
# given a Date vector where columns 1:10 are measurement date and
# 11:16 visit dates, create a logical vector of length 10 where
# the value is TRUE if the corresponding measurement column
# is within 10 days of any of the visit dates
check_within_10d <- function(row){
  return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, any))
}
# temporary dataframe of logical values for all checks on all dates
df_lgl <- cbind(df1,df2) %>% 
  apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>% 
  data.frame %>% 
  t
# create a result dataframe replacing logicals with corresponding dates
df3 <- df1
for(i in 1:ncol(df3)){ 
  df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
}

Output

> df3
  MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-09-27   2018-10-04   2018-10-26   2018-11-03   1970-01-01    1970-01-01
2         <NA>         <NA>         <NA>   2018-11-12   2018-12-30   2019-01-03         <NA>         <NA>         <NA>          <NA>
3   2019-08-28   2020-03-15   2020-03-16         <NA>         <NA>         <NA>         <NA>         <NA>         <NA>          <NA>

Some NA values are there because some visit date are NA. So the check_within_10d function cannot be sure that one of the missing visit dates is within 10 dates of a measurement date.

If you want to ignore the missing visit dates in your check, use

convert_to_vector <- function(row){
  return(row %>% t %>% as.Date)
}
# changed function to any(na.rm=TRUE)
check_within_10d <- function(row){
  return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, function(x){any(x,na.rm=T)}))
}
df_lgl <- cbind(df1,df2) %>% 
  apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>% 
  data.frame %>% 
  t
# replace missing measurement values to NA
df3 <- df1
for(i in 1:ncol(df3)){ 
  df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
  df3[,i] <- if_else(is.na(df1[,i]), df1[,i], df3[,i])
}

Output

> df3
  MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-09-27   2018-10-04   2018-10-26   2018-11-03   1970-01-01    1970-01-01
2   1970-01-01   1970-01-01   1970-01-01   2018-11-12   2018-12-30   2019-01-03   1970-01-01   1970-01-01         <NA>          <NA>
3   2019-08-28   2020-03-15   2020-03-16   1970-01-01   1970-01-01   1970-01-01         <NA>         <NA>         <NA>          <NA>
  • Related