Home > Back-end >  Find duplicate rows based on 2 columns and keep rows based on the value of a 3rd column in R
Find duplicate rows based on 2 columns and keep rows based on the value of a 3rd column in R

Time:03-08

I have a dataset with ID numbers, dates, and test results, and need to create a final dataset where each row consists of a unique ID, date, and test result value. How can I find duplicates based on ID and date, and then keep rows based on a specific test result value?

df <- data.frame(id_number = c(1, 1, 2, 2, 3, 3, 3, 4),
                 date = c('2021-11-03', '2021-11-19', '2021-11-11', '2021-11-11', '2021-11-05', '2021-11-05', '2021-11-16', '2021-11-29'),
                 result = c(0,1,0,0,0,9,0,9) )

  id_number date       result
      <dbl> <chr>       <dbl>
1         1 2021-11-03      0
2         1 2021-11-19      1
3         2 2021-11-11      0
4         2 2021-11-11      0
5         3 2021-11-05      0
6         3 2021-11-05      9
7         3 2021-11-16      0
8         4 2021-11-29      9

df <- unique(df)

After using the unique function, I am still left with rows that have duplicate id_number and date, and different test results. Of these, I need to keep only the row that equals 0 or 1, and exclude any 9s.

In the example below, I'd want to keep row 4 and exclude row 5. I can't simply exclude rows where result = 9 because I want to keep those for any non-duplicate observations.

  id_number date       result
      <dbl> <chr>       <dbl>
1         1 2021-11-03      0
2         1 2021-11-19      1
3         2 2021-11-11      0
4         3 2021-11-05      0
5         3 2021-11-05      9
6         3 2021-11-16      0
7         4 2021-11-29      9

CodePudding user response:

You can do:

library(tidyverse)

df %>%
  group_by(id_number, date) %>%
  filter(!(result == 9 & row_number() > 1)) %>%
  ungroup()

# A tibble: 6 x 3
  id_number date       result
      <dbl> <chr>       <dbl>
1         1 2021-11-03      0
2         1 2021-11-19      1
3         2 2021-11-11      0
4         3 2021-11-05      0
5         3 2021-11-16      0
6         4 2021-11-29      9

CodePudding user response:

For simplicity of understanding, use:

a) rows different than 9:

df <- subset(df,df$result != 9)

And then

b) Remove duplicated:

df <- subset(df, duplicated(df)==FALSE)

So if you want specific columns:

df <- subset(df,duplicated(df$result)==FALSE)

Or:

df <- subset(df,duplicated(df[ ,2:3])==FALSE)
  • Related