Home > database >  How to remove duplicates based on missing data in another column?
How to remove duplicates based on missing data in another column?

Time:07-19

I have a dataset that looks like this:

   Study_ID Recurrent_Status
1       100                1
2       100               NA
3       100               NA
4       200                1
5       300               NA
6       400                3
7       400               NA
8       500                3
9       500               NA
10      600               NA
11      700                1

I would like to remove any Study IDs that are duplicates, but keep the entry where there is data for 'recurrent status'. In other words, I want to remove every duplicate study ID where there is NA for 'recurrent status'. Recurrent status is either a value of 1 or 3 (or NA for some unduplicated patients).

My desired output would look something like this:

  Study_ID Recurrent_Status
1      100                1
2      200                1
3      300               NA
4      400                3
5      500                3
6      600               NA
7      700                1

I've tried to use this code, but it of course removes individuals with a recurrent status of 1 or 3, instead of retaining them.

full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status !="1")
full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status !="3")

When I try to remove the explanation mark, I get this error:

full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status ="1")

Error: unexpected '=' in "full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status ="

How can I go about doing this?

Reproducible data:

data<-data.frame(Study_ID=c("100","100","100","200","300","400","400","500","500","600","700"),Recurrent_Status=c("1","NA","NA","1","NA","3","NA","3","NA","NA","1"))

CodePudding user response:

We could arrange by the non-NA elements in 'Recurrent_Status' along with the first column and then use distinct

library(dplyr)
data %>% 
  arrange(Study_ID, is.na(Recurrent_Status)) %>%
  distinct(Study_ID, .keep_all = TRUE)

-output

  Study_ID Recurrent_Status
1      100                1
2      200                1
3      300               NA
4      400                3
5      500                3
6      600               NA
7      700                1

CodePudding user response:

Another dplyr option:

df <- read.table(text = "   Study_ID Recurrent_Status
1       100                1
2       100               NA
3       100               NA
4       200                1
5       300               NA
6       400                3
7       400               NA
8       500                3
9       500               NA
10      600               NA
11      700                1", header = TRUE)

library(dplyr)
df %>%
  group_by(Study_ID) %>%
  slice(which.max(!is.na(Recurrent_Status)))
#> # A tibble: 7 × 2
#> # Groups:   Study_ID [7]
#>   Study_ID Recurrent_Status
#>      <int>            <int>
#> 1      100                1
#> 2      200                1
#> 3      300               NA
#> 4      400                3
#> 5      500                3
#> 6      600               NA
#> 7      700                1

Created on 2022-07-18 by the reprex package (v2.0.1)

CodePudding user response:

Here we group then arrange and keep the first entry:

library(dplyr)

data %>% 
  group_by(Study_ID) %>% 
  arrange(Recurrent_Status, .by_group = TRUE) %>% 
  slice(1)
 Study_ID Recurrent_Status
  <chr>    <chr>           
1 100      1               
2 200      1               
3 300      NA              
4 400      3               
5 500      3               
6 600      NA              
7 700      1  

CodePudding user response:

data %>%
   group_by(Study_ID) %>%
   filter(!is.na(Recurrent_Status)|all(is.na(Recurrent_Status)))

# A tibble: 7 x 2
# Groups:   Study_ID [7]
  Study_ID Recurrent_Status
     <int>            <int>
1      100                1
2      200                1
3      300               NA
4      400                3
5      500                3
6      600               NA
7      700                1
  • Related