I'm working in R and I need to filter duplicates from a large dataset based on two conditions:
In principle, retain the record with the first date of diagnosis UNLESS a later date included a negative diagnosis - in which case retain the record with the negative diagnosis.
Here's a test dataframe where patients 12 and 13 are duplicated:
patient <- c('11', '12', '12', '13', '13', '14')
diagnosis <- c('positive', 'positive', 'negative', 'positive', 'positive', 'negative')
date <- c('01-01-2020', '02-02-2020', '03-03-2020', '04-04-2020', '05-05-2020', '06-06-2020') %>% as.Date("%d-%m-%y")
data <- data.frame(patient, diagnosis, date)
print(data)
patient diagnosis date
11 positive 2020-01-01
12 positive 2020-02-02
12 negative 2020-03-03
13 positive 2020-04-04
13 positive 2020-05-05
14 negative 2020-06-06
I need to get the following result where patient 12's negative diagnosis at a later date is retained and patient 13's first diagnosis was retained:
patient diagnosis date
11 positive 2020-01-01
12 negative 2020-03-03
13 positive 2020-04-04
14 negative 2020-06-06
My solution starts off as:
data <- group_by(data, patient) %>% #group by patient number
arrange(date, .by_group = TRUE) #arrange by date in ascending order
I'm not sure how to proceed from here. If I only needed the first date of diagnosis across duplicates, I could use slice()
and retain the first record. On the other hand, if I only needed to choose based on the diagnosis - I could use filter()
. I can't figure out how to apply both conditions.
Any help would be great! This was previously done manually in Excel, so figuring this out in R would be a huge step up.
CodePudding user response:
data %>%
arrange(diagnosis, date) %>%
group_by(patient) %>%
slice(1) %>%
ungroup()
# # A tibble: 4 x 3
# patient diagnosis date
# <chr> <chr> <date>
# 1 11 positive 2020-01-01
# 2 12 negative 2020-03-03
# 3 13 positive 2020-04-04
# 4 14 negative 2020-06-06
CodePudding user response:
With data.table
:
library(data.table)
dt <- as.data.table(data)
dt[order(patient, diagnosis, date), head(.SD, 1L), by = patient]
Output
patient diagnosis date
1: 11 positive 2020-01-01
2: 12 negative 2020-03-03
3: 13 positive 2020-04-04
4: 14 negative 2020-06-06