Home > OS >  R: filter duplicates/rows based on two conditions
R: filter duplicates/rows based on two conditions

Time:04-26

I'm working in R and I need to filter duplicates from a large dataset based on two conditions:

In principle, retain the record with the first date of diagnosis UNLESS a later date included a negative diagnosis - in which case retain the record with the negative diagnosis.

Here's a test dataframe where patients 12 and 13 are duplicated:

patient <- c('11', '12', '12', '13', '13', '14')
diagnosis <- c('positive', 'positive', 'negative', 'positive', 'positive', 'negative')
date <- c('01-01-2020', '02-02-2020', '03-03-2020', '04-04-2020', '05-05-2020', '06-06-2020') %>% as.Date("%d-%m-%y")

data <- data.frame(patient, diagnosis, date)

print(data)

patient diagnosis date
11      positive  2020-01-01
12      positive  2020-02-02
12      negative  2020-03-03
13      positive  2020-04-04
13      positive  2020-05-05
14      negative  2020-06-06

I need to get the following result where patient 12's negative diagnosis at a later date is retained and patient 13's first diagnosis was retained:

patient diagnosis date
11      positive  2020-01-01
12      negative  2020-03-03
13      positive  2020-04-04
14      negative  2020-06-06

My solution starts off as:

data <- group_by(data, patient) %>%     #group by patient number
        arrange(date, .by_group = TRUE) #arrange by date in ascending order

I'm not sure how to proceed from here. If I only needed the first date of diagnosis across duplicates, I could use slice() and retain the first record. On the other hand, if I only needed to choose based on the diagnosis - I could use filter(). I can't figure out how to apply both conditions.

Any help would be great! This was previously done manually in Excel, so figuring this out in R would be a huge step up.

CodePudding user response:

data %>%
  arrange(diagnosis, date) %>%
  group_by(patient) %>%
  slice(1) %>%
  ungroup()
# # A tibble: 4 x 3
#   patient diagnosis date      
#   <chr>   <chr>     <date>    
# 1 11      positive  2020-01-01
# 2 12      negative  2020-03-03
# 3 13      positive  2020-04-04
# 4 14      negative  2020-06-06

CodePudding user response:

With data.table:

library(data.table)
dt <- as.data.table(data)

dt[order(patient, diagnosis, date), head(.SD, 1L), by = patient]

Output

   patient diagnosis       date
1:      11  positive 2020-01-01
2:      12  negative 2020-03-03
3:      13  positive 2020-04-04
4:      14  negative 2020-06-06
  • Related