Home > Back-end >  R: Delete ID if any of the rows within ID contain specific strings (or multiple partial strings) in
R: Delete ID if any of the rows within ID contain specific strings (or multiple partial strings) in

Time:02-10

I would like to delete ID in any row that contains certain strings (A or D) within ID. Here is my data frame:

id   time   dx
1     1     C
1     2     B
2     1     A
2     2     C
2     3     B
3     1     D

I would like the following:

id    time  dx
 1     1     C
 1     2     B

Based on the earlier post regarding this (Delete rows containing specific strings in R), I tried d %>% filter(!grepl('A|D', dx)). However, it only deletes the rows that contain A or D, not the whole IDs. I'd appreciate any help!

##Update: All the below answers worked well for the above post. Thank you all! Note that for this post, I simplified my data frame, and later on, I realized that I actually needed R codes to delete the IDs with certain partial strings (e.g., A or B0) from the following data frame. I was able to achieve this by modifying the first r2evans' answer: d %>% group_by(id) %>% filter(!any(str_detect(dx, "A|B0"))) %>% ungroup(). I have included the note here in case someone needs it. I would appreciate any additional suggestions.

Data frame:

id time dx
1   1   C01
1   2   B1
2   1   A34
2   2   C01
2   3   B1
3   1   B01X

The results I wanted:

id time dx
1   1   C01
1   2   B1

CodePudding user response:

grep is the wrong tool for this based on your question and sample data, I think %in% is the better way to go. Combine that with natural dplyr:group_by and an any(.) conditional, and we get our results

dplyr

dat %>%
  group_by(id) %>%
  filter(!any(dx %in% c("A", "D"))) %>%
  ungroup()
# # A tibble: 2 x 3
#      id  time dx   
#   <int> <int> <chr>
# 1     1     1 C    
# 2     1     2 B    

base R

dat[ave(dat$dx, dat$id, FUN = function(z) !any(z %in% c("A", "D"))) == "TRUE",]
#   id time dx
# 1  1    1  C
# 2  1    2  B

(ave requires that its output be the same class as its input which, in this case, is character. That's why I'm comparing against the string "TRUE" instead of using it as a literal TRUE.)


Data

dat <- structure(list(id = c(1L, 1L, 2L, 2L, 2L, 3L), time = c(1L, 2L, 1L, 2L, 3L, 1L), dx = c("C", "B", "A", "C", "B", "D")), class = "data.frame", row.names = c(NA, -6L))

CodePudding user response:

We may use subset in base R

subset(df1, !id %in% id[dx %in% c("A", "D")])
  id time dx
1  1    1  C
2  1    2  B

Or a similar option with filter from dplyr

library(dplyr)
filter(df1, !id %in% id[dx %in% c("A", "D")])
  id time dx
1  1    1  C
2  1    2  B

data

df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 2L, 3L), time = c(1L, 2L, 
1L, 2L, 3L, 1L), dx = c("C", "B", "A", "C", "B", "D")), 
class = "data.frame", row.names = c(NA, 
-6L))

CodePudding user response:

Another base R option using subset ave

subset(
  df,
  !ave(dx %in% c("A", "D"), id, FUN = any)
)

gives

  id time dx
1  1    1  C
2  1    2  B
  • Related