Home > Enterprise >  Eliminate duplicates based on 2 dates in a dataframe
Eliminate duplicates based on 2 dates in a dataframe

Time:09-22

I have this sample dataframe:

df <- data.frame(ID = c("5","5","5","5","5","5" ,"5"    ,"5","5","5","5","14","14","14","14" ,"14","14"),
                 Date1= c("22/07/2014","22/07/2014","22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                          ,"22/07/2014"
                          ,"08/11/2016" 
                         , "08/11/2016"
                         , "08/11/2016"
                         , "08/11/2016"
                         , "08/11/2016"
                         , "08/11/2016"),
                 Date2= c("01/01/2011"
                          ,"01/08/2011"
                          ,"01/12/2010"
                          ,"10/11/2015"
                          ,"22/07/2014"
                          ,"01/01/2013"
                          ,"23/04/2014"
                          ,"01/01/2006"
                          ,"01/01/2013"
                          ,"01/10/2012"
                          ,"01/08/2012"
                          ,"14/04/2015"
                          ,"01/10/2008"
                          ,"01/10/2008"
                          ,"14/05/2015"
                          ,"11/04/2015"
                          ,"05/10/2008"),
stringsAsFactors = F)

where I have each ID repeated several times. I need to get a dataframe with only 1 line per ID. As you can see, each patient has only one date in column df$date1, so the condition to select 1 row per patient would be: to pick the closest date between date 1 and date 2.

How could I do that?

Thansk

CodePudding user response:

Here a tidyverse approach. I created a column called diff_date, which is the absolute difference between Date1 and Date2. Than I filtered by each ID the minimum difference.

library(dplyr)
library(lubridate)
  
  df %>% 
  mutate(
    across(.cols = starts_with("Date"),.fns = dmy),
    diff_date = abs(as.numeric(difftime(Date1,Date2)))
    ) %>% 
  group_by(ID) %>% 
  filter(diff_date == min(diff_date))

# A tibble: 2 x 4
# Groups:   ID [2]
  ID    Date1      Date2      diff_date
  <chr> <date>     <date>         <dbl>
1 5     2014-07-22 2014-07-22         0
2 14    2016-11-08 2015-05-14  47001600

CodePudding user response:

Try the following base R code

unique(
  subset(
    df,
    !!ave(
      abs(as.integer(as.Date(Date2, format = "%d/%m/%Y") - as.Date(Date1, format = "%d/%m/%Y"))),
      ID,
      FUN = function(x) x == min(x)
    )
  )
)

and you will get

   ID      Date1      Date2
5   5 22/07/2014 22/07/2014
15 14 08/11/2016 14/05/2015

CodePudding user response:

Using base R, convert the 'Date' columns to Date class, order the data based on the 'ID' and the absolute difference between the Date columns, subset with duplicated i.e. first unique row on 'ID' column

df[2:3] <- lapply(df[2:3], as.Date, format = "%d/%m/%Y")
df1 <- df[with(df, order(ID, abs(as.numeric(Date1) - as.numeric(Date2)))),]
df1[!duplicated(df1$ID),]

-output

ID      Date1      Date2
15 14 2016-11-08 2015-05-14
5   5 2014-07-22 2014-07-22
  • Related