I have a text file named headers.txt containing several FASTA inputs. I have to break apart the header for each of them into individual columns containing the ID, sequence name, node number, sequence length and sequencing type. Each of them should be in their own separate column in the same data frame.
They are written out as follows (headers.txt)
NZ_MCQZ01000071.1:2282-2767 Klebsiella pneumoniae strain TR196 Scaffold45_1, whole genome shotgun sequence
RYOH01000117.1:3-590 Klebsiella pneumoniae strain 16WZ-131 NODE_117_length_2026_cov_233.332478, whole genome shotgun sequence
RYOJ01000145.1:3-857 Klebsiella pneumoniae strain 16WZ-128 NODE_145_length_2293_cov_224.091606, whole genome shotgun sequence
NZ_CABWRH010000049.1:1707-2128 Klebsiella pneumoniae strain SRRSH43 isolate SRRSH43, whole genome shotgun sequence
RYQS01000239.1:1916-2698 Klebsiella pneumoniae strain 16HN-12 NODE_239_length_2763_cov_7.539092, whole genome shotgun sequence
Information on header based on last sequence
ID: RYQS01000239.1:1916-2698
sequence name: Klebsiella pneumoniae strain 16HN-12
node number: NODE_239
sequence length: length_2763_cov_7.539092
sequencing type: whole genome shotgun sequence
My problem is that these .txt files can contain hundreds of FASTA inputs such as these with some containing the necessary information and some do not (for example some state the length and some do not), such as the example sequence headers given above. Thus, the column should still be generated but left open if the necessary data is not found.
I have tried using strsplit, however i can't find a delimiter that works for all of them and this is as far as I have gotten
library("Biostrings")
fastaFile <- readDNAStringSet("~/ex1/headers.txt")
seq_name = names(fastaFile)
df <- data.frame(seq_name)
library(stringr)
df[c('ID', 'Sequence name','Sequence length','Node number','Sequencing type')] <- str_split_fixed(df$seq_name, ' ', 5)
df <- df[c('ID', 'Sequencing name', 'Sequence length','Node number','Sequencing type')]
The data frame should look like this
ID | Sequence name | Node number | Sequence length | Sequencing type |
---|---|---|---|---|
RYQS01000239.1:1916-2698 | Klebsiella pneumoniae strain 16HN-12 | NODE_239 | length_2763_cov_7.539092 | whole genome shotgun sequence |
CodePudding user response:
You could use a kind of look-behind to the colon plus some text until the space. Since (?<=:\w -\w )
won't work because of varying length, we may use \K
which resets the match at this point. The other regex are straightforward.
readLines('headers.txt') |>
{\(.) .[. != '']}() |>
strsplit(':\\w -\\w \\K\\s|\\sNODE_|_length_|_cov_|,\\s', perl=TRUE) |>
{\(.) lapply(., `length<-`, max(lengths(.)))}() |>
lapply(\(x) {g <- grepl('genome', x);if (any(g)) {x[length(x)] <- x[g]; x[g] <- NA};x}) |>
do.call(what='rbind.data.frame') |>
type.convert(as.is=TRUE) |>
setNames(c('ID', 'name', 'node', 'length', 'cov', 'type'))
# ID name node length cov type
# 1 NZ_MCQZ01000071.1:2282-2767 Klebsiella pneumoniae strain TR196 Scaffold45_1 NA NA NA whole genome shotgun sequence
# 2 RYOH01000117.1:3-590 Klebsiella pneumoniae strain 16WZ-131 117 2026 233.332478 <NA>
# 3 RYOJ01000145.1:3-857 Klebsiella pneumoniae strain 16WZ-128 145 2293 224.091606 <NA>
# 4 NZ_CABWRH010000049.1:1707-2128 Klebsiella pneumoniae strain SRRSH43 isolate SRRSH43 NA NA NA whole genome shotgun sequence
# 5 RYQS01000239.1:1916-2698 Klebsiella pneumoniae strain 16HN-12 239 2763 7.539092 <NA>
See the demo.
Note: R >= 4.1 used.