I have a dataset with ID numbers, dates, and test results, and need to create a final dataset where each row consists of a unique ID, date, and test result value. How can I find duplicates based on ID and date, and then keep rows based on a specific test result value?
df <- data.frame(id_number = c(1, 1, 2, 2, 3, 3, 3, 4),
date = c('2021-11-03', '2021-11-19', '2021-11-11', '2021-11-11', '2021-11-05', '2021-11-05', '2021-11-16', '2021-11-29'),
result = c(0,1,0,0,0,9,0,9) )
id_number date result
<dbl> <chr> <dbl>
1 1 2021-11-03 0
2 1 2021-11-19 1
3 2 2021-11-11 0
4 2 2021-11-11 0
5 3 2021-11-05 0
6 3 2021-11-05 9
7 3 2021-11-16 0
8 4 2021-11-29 9
df <- unique(df)
After using the unique function, I am still left with rows that have duplicate id_number and date, and different test results. Of these, I need to keep only the row that equals 0 or 1, and exclude any 9s.
In the example below, I'd want to keep row 4 and exclude row 5. I can't simply exclude rows where result = 9 because I want to keep those for any non-duplicate observations.
id_number date result
<dbl> <chr> <dbl>
1 1 2021-11-03 0
2 1 2021-11-19 1
3 2 2021-11-11 0
4 3 2021-11-05 0
5 3 2021-11-05 9
6 3 2021-11-16 0
7 4 2021-11-29 9
CodePudding user response:
You can do:
library(tidyverse)
df %>%
group_by(id_number, date) %>%
filter(!(result == 9 & row_number() > 1)) %>%
ungroup()
# A tibble: 6 x 3
id_number date result
<dbl> <chr> <dbl>
1 1 2021-11-03 0
2 1 2021-11-19 1
3 2 2021-11-11 0
4 3 2021-11-05 0
5 3 2021-11-16 0
6 4 2021-11-29 9
CodePudding user response:
For simplicity of understanding, use:
a) rows different than 9:
df <- subset(df,df$result != 9)
And then
b) Remove duplicated:
df <- subset(df, duplicated(df)==FALSE)
So if you want specific columns:
df <- subset(df,duplicated(df$result)==FALSE)
Or:
df <- subset(df,duplicated(df[ ,2:3])==FALSE)