Here's how my data frame looks like...
I
Fasta headers |
---|
ab12_P002;ab12_P003;ab12_P005;ab23_P002;ab23_P001 |
ab45_P001;ab36_P001 |
ab55_P001;ab55_P002 |
I managed to make a delimited string in a column into a row using code below
library(tidyr)
library(dplyr)
without_02473 %>%
mutate(`Fasta headers` = strsplit(as.character(`Fasta headers`),
";")) %>% unnest(`Fasta headers`)
which resulted in following
Fasta headers |
---|
ab12_P002 |
ab12_P003 |
ab12_P005 |
ab23_P002 |
ab23_P001 |
ab45_P001 |
However, I would like to have following result in the end.
Fasta headers |
---|
ab12 |
ab23 |
ab45 |
ab36 |
I tried to use group and filter, unnest(string_string_array), but I didn't manage to do it. Can someone help me?
CodePudding user response:
Here is slightly different approach:
library(stringr)
library(dplyr)
library(tidyr)
without_02473 %>%
separate_rows(`Fasta headers`) %>%
filter(str_detect(`Fasta headers`, 'ab\\d ')) %>%
distinct()
`Fasta headers`
<chr>
1 ab12
2 ab23
3 ab45
4 ab36
5 ab55
CodePudding user response:
Another option using strsplit
and remove everything after _ and filter distinct
s like this:
library(dplyr)
library(tidyr)
without_02473 %>%
mutate(`Fasta headers` = strsplit(`Fasta headers`, ";")) %>%
unnest(`Fasta headers`) %>%
mutate(`Fasta headers` = sub("_[^_] $", "", `Fasta headers`)) %>%
distinct()
#> # A tibble: 5 × 1
#> `Fasta headers`
#> <chr>
#> 1 ab12
#> 2 ab23
#> 3 ab45
#> 4 ab36
#> 5 ab55
Created on 2023-01-03 with reprex v2.0.2
CodePudding user response:
We may use separate_rows
to split the Fasta headers
at the ;
to create new rows and then remove the suffix part starting from the _
with trimws
library(dplyr)
library(tidyr)
out <- without_02473 %>%
separate_rows(`Fasta headers`, sep = ";") %>%
mutate(`Fasta headers` = trimws(`Fasta headers`, whitespace = "_.*")) %>%
distinct
-output
out
# A tibble: 5 × 1
`Fasta headers`
<chr>
1 ab12
2 ab23
3 ab45
4 ab36
5 ab55
library(writexl)
write_xlsx(out, "first.xlsx")
Or may extract only the word before the _
with str_extract_all
, unnest
the list
column and get the distinct
rows
library(stringr)
without_02473 %>%
mutate(`Fasta headers` = str_extract_all(`Fasta headers`,
"\\w (?=_)")) %>%
unnest(`Fasta headers`) %>%
distinct
data
without_02473 <- structure(list(`Fasta headers` = c("ab12_P002;ab12_P003;ab12_P005;ab23_P002;ab23_P001",
"ab45_P001;ab36_P001", "ab55_P001;ab55_P002")), class = "data.frame", row.names = c(NA,
-3L))