I have a dataset that looks like this:
Study_ID Recurrent_Status
1 100 1
2 100 NA
3 100 NA
4 200 1
5 300 NA
6 400 3
7 400 NA
8 500 3
9 500 NA
10 600 NA
11 700 1
I would like to remove any Study IDs that are duplicates, but keep the entry where there is data for 'recurrent status'. In other words, I want to remove every duplicate study ID where there is NA for 'recurrent status'. Recurrent status is either a value of 1 or 3 (or NA for some unduplicated patients).
My desired output would look something like this:
Study_ID Recurrent_Status
1 100 1
2 200 1
3 300 NA
4 400 3
5 500 3
6 600 NA
7 700 1
I've tried to use this code, but it of course removes individuals with a recurrent status of 1 or 3, instead of retaining them.
full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status !="1")
full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status !="3")
When I try to remove the explanation mark, I get this error:
full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status ="1")
Error: unexpected '=' in "full_data<-filter(full_data, !duplicated(MRN, fromLast = TRUE) | Recurrence_status ="
How can I go about doing this?
Reproducible data:
data<-data.frame(Study_ID=c("100","100","100","200","300","400","400","500","500","600","700"),Recurrent_Status=c("1","NA","NA","1","NA","3","NA","3","NA","NA","1"))
CodePudding user response:
We could arrange
by the non-NA elements in 'Recurrent_Status' along with the first column and then use distinct
library(dplyr)
data %>%
arrange(Study_ID, is.na(Recurrent_Status)) %>%
distinct(Study_ID, .keep_all = TRUE)
-output
Study_ID Recurrent_Status
1 100 1
2 200 1
3 300 NA
4 400 3
5 500 3
6 600 NA
7 700 1
CodePudding user response:
Another dplyr
option:
df <- read.table(text = " Study_ID Recurrent_Status
1 100 1
2 100 NA
3 100 NA
4 200 1
5 300 NA
6 400 3
7 400 NA
8 500 3
9 500 NA
10 600 NA
11 700 1", header = TRUE)
library(dplyr)
df %>%
group_by(Study_ID) %>%
slice(which.max(!is.na(Recurrent_Status)))
#> # A tibble: 7 × 2
#> # Groups: Study_ID [7]
#> Study_ID Recurrent_Status
#> <int> <int>
#> 1 100 1
#> 2 200 1
#> 3 300 NA
#> 4 400 3
#> 5 500 3
#> 6 600 NA
#> 7 700 1
Created on 2022-07-18 by the reprex package (v2.0.1)
CodePudding user response:
Here we group then arrange and keep the first entry:
library(dplyr)
data %>%
group_by(Study_ID) %>%
arrange(Recurrent_Status, .by_group = TRUE) %>%
slice(1)
Study_ID Recurrent_Status
<chr> <chr>
1 100 1
2 200 1
3 300 NA
4 400 3
5 500 3
6 600 NA
7 700 1
CodePudding user response:
data %>%
group_by(Study_ID) %>%
filter(!is.na(Recurrent_Status)|all(is.na(Recurrent_Status)))
# A tibble: 7 x 2
# Groups: Study_ID [7]
Study_ID Recurrent_Status
<int> <int>
1 100 1
2 200 1
3 300 NA
4 400 3
5 500 3
6 600 NA
7 700 1