How do I split multiple delimited strings in a column in to a row and delete the duplicates in that-CodePudding

Here's how my data frame looks like...

Fasta headers
ab12_P002;ab12_P003;ab12_P005;ab23_P002;ab23_P001
ab45_P001;ab36_P001
ab55_P001;ab55_P002

I managed to make a delimited string in a column into a row using code below

library(tidyr)
library(dplyr)
without_02473 %>% 
  mutate(`Fasta headers` = strsplit(as.character(`Fasta headers`), 
      ";")) %>%   unnest(`Fasta headers`)

which resulted in following

Fasta headers
ab12_P002
ab12_P003
ab12_P005
ab23_P002
ab23_P001
ab45_P001

However, I would like to have following result in the end.

Fasta headers
ab12
ab23
ab45
ab36

I tried to use group and filter, unnest(string_string_array), but I didn't manage to do it. Can someone help me?

CodePudding user response：

Here is slightly different approach:

library(stringr)
library(dplyr)
library(tidyr)

without_02473 %>% 
  separate_rows(`Fasta headers`) %>% 
  filter(str_detect(`Fasta headers`, 'ab\\d ')) %>% 
  distinct()

  `Fasta headers`
  <chr>          
1 ab12           
2 ab23           
3 ab45           
4 ab36           
5 ab55

CodePudding user response：

Another option using strsplit and remove everything after _ and filter distincts like this:

library(dplyr)
library(tidyr)
without_02473 %>%
  mutate(`Fasta headers` = strsplit(`Fasta headers`, ";")) %>%
  unnest(`Fasta headers`) %>%
  mutate(`Fasta headers` = sub("_[^_] $", "", `Fasta headers`)) %>%
  distinct()
#> # A tibble: 5 × 1
#>   `Fasta headers`
#>   <chr>          
#> 1 ab12           
#> 2 ab23           
#> 3 ab45           
#> 4 ab36           
#> 5 ab55

^{Created on 2023-01-03 with reprex v2.0.2}

CodePudding user response：

We may use separate_rows to split the Fasta headers at the ; to create new rows and then remove the suffix part starting from the _ with trimws

library(dplyr)
library(tidyr)
out <- without_02473 %>% 
   separate_rows(`Fasta headers`, sep = ";") %>%
   mutate(`Fasta headers` = trimws(`Fasta headers`, whitespace = "_.*")) %>%
   distinct

-output

out
# A tibble: 5 × 1
  `Fasta headers`
  <chr>          
1 ab12           
2 ab23           
3 ab45           
4 ab36           
5 ab55   
library(writexl) 
write_xlsx(out, "first.xlsx")

Or may extract only the word before the _ with str_extract_all, unnest the list column and get the distinct rows

library(stringr)
without_02473 %>%
   mutate(`Fasta headers` = str_extract_all(`Fasta headers`, 
        "\\w (?=_)")) %>%
   unnest(`Fasta headers`) %>%
   distinct

data

without_02473 <- structure(list(`Fasta headers` = c("ab12_P002;ab12_P003;ab12_P005;ab23_P002;ab23_P001", 
"ab45_P001;ab36_P001", "ab55_P001;ab55_P002")), class = "data.frame", row.names = c(NA, 
-3L))