I have a dataset that looks like this:
Study_ID Stage
1 100 Early Stage
2 100 Stable
3 200 Stable
4 300 Early Stage
5 400 Early Stage
6 400 Stable
7 500 Early Stage
8 500 Stable
9 600 Stable
10 700 Early Stage
I would like to remove any Study IDs that are duplicates, but keep the entry where the patient is 'stable'. In other words, I want to remove every duplicate study ID where the patient is 'Early Stage'.
My desired output would look something like this:
Study_ID Stage
1 100 Stable
2 200 Stable
3 300 Early Stage
4 400 Stable
5 500 Stable
6 600 Stable
7 700 Early Stage
How can I go about doing this?
Reproducible data:
data<-data.frame(Study_ID=c("100","100","200","300","400","400","500","500","600","700"),Stage=c("Early Stage","Stable","Stable","Early Stage","Early Stage","Stable","Early Stage","Stable","Stable","Early Stage"))
CodePudding user response:
You can use the following code:
data<-data.frame(Study_ID=c("100","100","200","300","400","400","500","500","600","700"),Stage=c("Early Stage","Stable","Stable","Early Stage","Early Stage","Stable","Early Stage","Stable","Stable","Early Stage"))
library(dplyr)
filter(data, !duplicated(Study_ID, fromLast = TRUE) | Stage !="Early Stage")
#> Study_ID Stage
#> 1 100 Stable
#> 2 200 Stable
#> 3 300 Early Stage
#> 4 400 Stable
#> 5 500 Stable
#> 6 600 Stable
#> 7 700 Early Stage
Created on 2022-06-30 by the reprex package (v2.0.1)
CodePudding user response:
library(tidyverse)
data %>%
group_by(Study_ID) %>%
filter(!(n() > 1 & Stage != "Stable"))
#> # A tibble: 7 × 2
#> # Groups: Study_ID [7]
#> Study_ID Stage
#> <chr> <chr>
#> 1 100 Stable
#> 2 200 Stable
#> 3 300 Early Stage
#> 4 400 Stable
#> 5 500 Stable
#> 6 600 Stable
#> 7 700 Early Stage
CodePudding user response:
Using by
. I added one case with two "stable" to the data as possible special case.
by(dat, dat$Study_ID, \(x) {
if (any(grepl('Stable', x$Stage))) {
unique(x[x$Stage == 'Stable', ])
} else {
unique(x)
}
}) |> do.call(what=rbind)
# Study_ID Stage
# 100 100 Stable
# 200 200 Stable
# 300 300 Early Stage
# 400 400 Stable
# 500 500 Stable
# 600 600 Stable
# 700 700 Early Stage
Or using Stage as.factor
and ave
the !duplicated
max
.
transform(dat, x=as.numeric(as.factor(Stage))) |>
subset(as.logical(ave(x, Study_ID, FUN=\(x) x == max(x) & !duplicated(x))) , -x)
# Study_ID Stage
# 2 100 Stable
# 3 200 Stable
# 4 300 Early Stage
# 6 400 Stable
# 8 500 Stable
# 9 600 Stable
# 11 700 Early Stage
Note, that this works because "Early" comes before "Stable" in the alphabet, otherwise use factor
and define levels=
order in the arguments.
Data:
dat <- structure(list(Study_ID = c(100L, 100L, 200L, 300L, 400L, 400L,
500L, 500L, 600L, 600L, 700L), Stage = c("Early Stage", "Stable",
"Stable", "Early Stage", "Early Stage", "Stable", "Early Stage",
"Stable", "Stable", "Stable", "Early Stage")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))