I have this sample dataframe:
df <- data.frame(ID = c("5","5","5","5","5","5" ,"5" ,"5","5","5","5","14","14","14","14" ,"14","14"),
Date1= c("22/07/2014","22/07/2014","22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"08/11/2016"
, "08/11/2016"
, "08/11/2016"
, "08/11/2016"
, "08/11/2016"
, "08/11/2016"),
Date2= c("01/01/2011"
,"01/08/2011"
,"01/12/2010"
,"10/11/2015"
,"22/07/2014"
,"01/01/2013"
,"23/04/2014"
,"01/01/2006"
,"01/01/2013"
,"01/10/2012"
,"01/08/2012"
,"14/04/2015"
,"01/10/2008"
,"01/10/2008"
,"14/05/2015"
,"11/04/2015"
,"05/10/2008"),
stringsAsFactors = F)
where I have each ID repeated several times. I need to get a dataframe with only 1 line per ID. As you can see, each patient has only one date in column df$date1, so the condition to select 1 row per patient would be: to pick the closest date between date 1 and date 2.
How could I do that?
Thansk
CodePudding user response:
Here a tidyverse
approach. I created a column called diff_date
, which is the absolute difference between Date1
and Date2
. Than I filtered by each ID
the minimum difference.
library(dplyr)
library(lubridate)
df %>%
mutate(
across(.cols = starts_with("Date"),.fns = dmy),
diff_date = abs(as.numeric(difftime(Date1,Date2)))
) %>%
group_by(ID) %>%
filter(diff_date == min(diff_date))
# A tibble: 2 x 4
# Groups: ID [2]
ID Date1 Date2 diff_date
<chr> <date> <date> <dbl>
1 5 2014-07-22 2014-07-22 0
2 14 2016-11-08 2015-05-14 47001600
CodePudding user response:
Try the following base R code
unique(
subset(
df,
!!ave(
abs(as.integer(as.Date(Date2, format = "%d/%m/%Y") - as.Date(Date1, format = "%d/%m/%Y"))),
ID,
FUN = function(x) x == min(x)
)
)
)
and you will get
ID Date1 Date2
5 5 22/07/2014 22/07/2014
15 14 08/11/2016 14/05/2015
CodePudding user response:
Using base R
, convert the 'Date' columns to Date
class, order
the data based on the 'ID' and the abs
olute difference between the Date columns, subset with duplicated
i.e. first unique row on 'ID' column
df[2:3] <- lapply(df[2:3], as.Date, format = "%d/%m/%Y")
df1 <- df[with(df, order(ID, abs(as.numeric(Date1) - as.numeric(Date2)))),]
df1[!duplicated(df1$ID),]
-output
ID Date1 Date2
15 14 2016-11-08 2015-05-14
5 5 2014-07-22 2014-07-22