How to remove duplicate values based on another column?-CodePudding

I have a dataset that looks like this:

   Study_ID       Stage
1       100 Early Stage
2       100      Stable
3       200      Stable
4       300 Early Stage
5       400 Early Stage
6       400      Stable
7       500 Early Stage
8       500      Stable
9       600      Stable
10      700 Early Stage

I would like to remove any Study IDs that are duplicates, but keep the entry where the patient is 'stable'. In other words, I want to remove every duplicate study ID where the patient is 'Early Stage'.

My desired output would look something like this:

  Study_ID       Stage
1      100      Stable
2      200      Stable
3      300 Early Stage
4      400      Stable
5      500      Stable
6      600      Stable
7      700 Early Stage

How can I go about doing this?

Reproducible data:

data<-data.frame(Study_ID=c("100","100","200","300","400","400","500","500","600","700"),Stage=c("Early Stage","Stable","Stable","Early Stage","Early Stage","Stable","Early Stage","Stable","Stable","Early Stage"))

CodePudding user response：

You can use the following code:

data<-data.frame(Study_ID=c("100","100","200","300","400","400","500","500","600","700"),Stage=c("Early Stage","Stable","Stable","Early Stage","Early Stage","Stable","Early Stage","Stable","Stable","Early Stage"))

library(dplyr)
filter(data, !duplicated(Study_ID, fromLast = TRUE) | Stage !="Early Stage")
#>   Study_ID       Stage
#> 1      100      Stable
#> 2      200      Stable
#> 3      300 Early Stage
#> 4      400      Stable
#> 5      500      Stable
#> 6      600      Stable
#> 7      700 Early Stage

^{Created on 2022-06-30 by the reprex package (v2.0.1)}

CodePudding user response：

library(tidyverse)

data %>% 
  group_by(Study_ID) %>% 
  filter(!(n() > 1 & Stage != "Stable"))
#> # A tibble: 7 × 2
#> # Groups:   Study_ID [7]
#>   Study_ID Stage      
#>   <chr>    <chr>      
#> 1 100      Stable     
#> 2 200      Stable     
#> 3 300      Early Stage
#> 4 400      Stable     
#> 5 500      Stable     
#> 6 600      Stable     
#> 7 700      Early Stage

CodePudding user response：

Using by. I added one case with two "stable" to the data as possible special case.

by(dat, dat$Study_ID, \(x) {
  if (any(grepl('Stable', x$Stage))) {
    unique(x[x$Stage == 'Stable', ])
  } else {
    unique(x)
  }
}) |> do.call(what=rbind)
#     Study_ID       Stage
# 100      100      Stable
# 200      200      Stable
# 300      300 Early Stage
# 400      400      Stable
# 500      500      Stable
# 600      600      Stable
# 700      700 Early Stage

Or using Stage as.factor and ave the !duplicated max.

transform(dat, x=as.numeric(as.factor(Stage))) |> 
  subset(as.logical(ave(x, Study_ID, FUN=\(x) x == max(x) & !duplicated(x))) , -x)
#    Study_ID       Stage
# 2       100      Stable
# 3       200      Stable
# 4       300 Early Stage
# 6       400      Stable
# 8       500      Stable
# 9       600      Stable
# 11      700 Early Stage

Note, that this works because "Early" comes before "Stable" in the alphabet, otherwise use factor and define levels= order in the arguments.

Data:

dat <- structure(list(Study_ID = c(100L, 100L, 200L, 300L, 400L, 400L, 
500L, 500L, 600L, 600L, 700L), Stage = c("Early Stage", "Stable", 
"Stable", "Early Stage", "Early Stage", "Stable", "Early Stage", 
"Stable", "Stable", "Stable", "Early Stage")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))