I have a dateframe with multiple measuring dates for each subjects in each row, and another dataframe with multiple visit dates for the same subject in each row (also including some NA's).
What I want is to extract the measuring dates that match the visit dates for a certain subject within a specific interval (say /- 10 days from visit date), and tag the measuring dates that do not fall within this interval (e.g, with a 'FALSE' or -99), and keep the NA's as is.
A similar question was asked here, but did not allow for measuring dates to be within an interval period from visit date.
set.seed(1)
# Dataframe with measure dates
df1 <- rbind.data.frame(sort(sample(seq(as.Date("2018-01-01"), as.Date("2019-01-01"), by = "day"), 10)),
c(sort(sample(seq(as.Date("2018-06-01"), as.Date("2019-06-01"), by = "day"), 8)), NA, NA),
c(sort(sample(seq(as.Date("2019-06-01"), as.Date("2020-06-01"), by = "day"), 6)), rep(NA, 4)))
names(df1) <- paste("MEASUREDATE", 1:10, sep = "")
myfun <- function(x) as.Date(x, format = "%Y-%m-%d", origin = "1970-01-01")
df1 <- data.frame(lapply(df1, myfun))
df1
# Dataframe with visit dates
df2 <- rbind.data.frame(as.numeric(df1[1, 2:7]), as.numeric(c(df1[2, 4:6], NA, NA, NA)), as.numeric(c(df1[3, 1:2], rep(NA, 4))))
df2 <- data.frame(lapply(df2, myfun))
names(df2) <- paste("VISIT", 1:6, sep = "")
df2
So the fist row of the new dataframe would be like this:
# New dataframe
df3 <- df1[1, ]
df3[1] <- FALSE
df3[9:10] <- FALSE
df3
Do you know how to tackle this problem? Any help is very much appreciated.
CodePudding user response:
here is a data.table
solution. In the second-to-last line, missing visitdates are set to 1-1-1970 (NA is not possible, or they would mix with the current NA.. and it will have to be a date).
If the date-format is nog necessairy, you can switch to charact5er and fill use any value you like...
library(data.table)
# set as data.table
setDT(df1); setDT(df2)
# add subject numbering
df1[, id := .I]
df2[, id := .I]
# melt to long format
df1.melt <- melt(setDT(df1), id.vars = "id")
df2.melt <- melt(setDT(df2), id.vars = "id")
# add margins arround visit dates
df2.melt[, `:=`(mindate = value - 10, maxdate = value 10)][]
# join visitdays within 10 days of measure (non-equi join)
df1.melt[df2.melt, visitdate := i.value, on = .(id, value >= mindate, value <= maxdate)]
# set missing visitdates to 31-12-2099 (keep date format)
df1.melt[!is.na(value) & is.na(visitdate), visitdate := 0]
# last step is to cast to wide again
dcast(df1.melt, id ~ variable, value.var = "visitdate")
# id MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
# 1: 1 1970-01-01 2018-05-09 2018-06-16 2018-07-06 2018-10-04 2018-10-04 2018-10-26 2018-10-26 1970-01-01 1970-01-01
# 2: 2 1970-01-01 1970-01-01 1970-01-01 2018-11-12 2019-01-03 2019-01-03 1970-01-01 1970-01-01 <NA> <NA>
# 3: 3 2019-08-28 2020-03-15 2020-03-15 1970-01-01 1970-01-01 1970-01-01 <NA> <NA> <NA> <NA>
CodePudding user response:
As Wimpel said, you cannot have a logical and a Date in the same column. So I will use 1970-01-01 as the FALSE value.
A solution using dplyr
library(dplyr)
# convert a row from a Date dataframe to a Date vector
convert_to_vector <- function(row){
return(row %>% t %>% as.Date)
}
# given a Date vector where columns 1:10 are measurement date and
# 11:16 visit dates, create a logical vector of length 10 where
# the value is TRUE if the corresponding measurement column
# is within 10 days of any of the visit dates
check_within_10d <- function(row){
return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, any))
}
# temporary dataframe of logical values for all checks on all dates
df_lgl <- cbind(df1,df2) %>%
apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>%
data.frame %>%
t
# create a result dataframe replacing logicals with corresponding dates
df3 <- df1
for(i in 1:ncol(df3)){
df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
}
Output
> df3
MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1 1970-01-01 2018-05-09 2018-06-16 2018-07-06 2018-09-27 2018-10-04 2018-10-26 2018-11-03 1970-01-01 1970-01-01
2 <NA> <NA> <NA> 2018-11-12 2018-12-30 2019-01-03 <NA> <NA> <NA> <NA>
3 2019-08-28 2020-03-15 2020-03-16 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
Some NA values are there because some visit date are NA. So the check_within_10d function cannot be sure that one of the missing visit dates is within 10 dates of a measurement date.
If you want to ignore the missing visit dates in your check, use
convert_to_vector <- function(row){
return(row %>% t %>% as.Date)
}
# changed function to any(na.rm=TRUE)
check_within_10d <- function(row){
return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, function(x){any(x,na.rm=T)}))
}
df_lgl <- cbind(df1,df2) %>%
apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>%
data.frame %>%
t
# replace missing measurement values to NA
df3 <- df1
for(i in 1:ncol(df3)){
df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
df3[,i] <- if_else(is.na(df1[,i]), df1[,i], df3[,i])
}
Output
> df3
MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1 1970-01-01 2018-05-09 2018-06-16 2018-07-06 2018-09-27 2018-10-04 2018-10-26 2018-11-03 1970-01-01 1970-01-01
2 1970-01-01 1970-01-01 1970-01-01 2018-11-12 2018-12-30 2019-01-03 1970-01-01 1970-01-01 <NA> <NA>
3 2019-08-28 2020-03-15 2020-03-16 1970-01-01 1970-01-01 1970-01-01 <NA> <NA> <NA> <NA>