Filter based on different conditions at different positions in a string in R-CodePudding

The middle part of the string is the ID, and I want only one occurrence of each ID. If there is more than one observation with the same six middle letters, I need to keep the one that says "07" rather than "08", or "A" rather than "B". I want to completely exclude if the number is "02". Other than that, if there is only one occurrence of the ID, I want to keep it. So if I had:

col1                       
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-XDUMNG-08B-01
ID-1-LOFBUM-02A-01
ID-1-ABYEMJ-08A-01  
ID-1-ABYEMJ-08B-01

Then I would want:

col1
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-ABYEMJ-08A-01

I am thinking maybe I can use group_by to specify the 6 letter ID, and then some kind of if_else statement? But I can't figure out how to specify the positions of the characters in the string. Any help is greatly appreciated!

CodePudding user response：

Using extract and some dplyr wrangling:

library(tidyr)
library(dplyr)
df %>% 
  extract(col1, "ID-\\d-(.*)-(\\d*)(A|B)-01",
          into = c("ID", "number", "letter"),
          remove = FALSE, convert = TRUE) %>% 
  group_by(ID) %>% 
  filter(number != 2) %>% 
  slice_min(n = 1, order(number, letter)) %>%
  ungroup() %>% 
  select(col1)

#                col1                        
#1 ID-1-ABYEMJ-08A-01
#2 ID-1-AMBCFG-07A-01
#3 ID-1-CGUMBD-08A-01
#4 ID-1-XDUMNG-07B-01

CodePudding user response：

An option with str_detect

library(stringr)
library(dplyr)
df1 %>% 
  group_by(ID = str_extract(col1, "ID-\\d -\\w ")) %>% 
  filter(str_detect(col1, "02", negate = TRUE), row_number() == 1) %>%
  ungroup %>% 
  select(-ID)

-output

# A tibble: 4 × 1
  col1              
  <chr>             
1 ID-1-AMBCFG-07A-01
2 ID-1-CGUMBD-08A-01
3 ID-1-XDUMNG-07B-01
4 ID-1-ABYEMJ-08A-01

data

df1 <- structure(list(col1 = c("ID-1-AMBCFG-07A-01", "ID-1-CGUMBD-08A-01", 
"ID-1-XDUMNG-07B-01", "ID-1-XDUMNG-08B-01", "ID-1-LOFBUM-02A-01", 
"ID-1-ABYEMJ-08A-01", "ID-1-ABYEMJ-08B-01")), class = "data.frame", 
row.names = c(NA, 
-7L))