How to split a string if there is no common delimiter-CodePudding

I have a text file named headers.txt containing several FASTA inputs. I have to break apart the header for each of them into individual columns containing the ID, sequence name, node number, sequence length and sequencing type. Each of them should be in their own separate column in the same data frame.

They are written out as follows (headers.txt)

NZ_MCQZ01000071.1:2282-2767 Klebsiella pneumoniae strain TR196 Scaffold45_1, whole genome shotgun sequence

RYOH01000117.1:3-590 Klebsiella pneumoniae strain 16WZ-131 NODE_117_length_2026_cov_233.332478, whole genome shotgun sequence

RYOJ01000145.1:3-857 Klebsiella pneumoniae strain 16WZ-128 NODE_145_length_2293_cov_224.091606, whole genome shotgun sequence

NZ_CABWRH010000049.1:1707-2128 Klebsiella pneumoniae strain SRRSH43 isolate SRRSH43, whole genome shotgun sequence

RYQS01000239.1:1916-2698 Klebsiella pneumoniae strain 16HN-12 NODE_239_length_2763_cov_7.539092, whole genome shotgun sequence

Information on header based on last sequence

ID: RYQS01000239.1:1916-2698 

sequence name: Klebsiella pneumoniae strain 16HN-12

node number: NODE_239

sequence length: length_2763_cov_7.539092

sequencing type: whole genome shotgun sequence

My problem is that these .txt files can contain hundreds of FASTA inputs such as these with some containing the necessary information and some do not (for example some state the length and some do not), such as the example sequence headers given above. Thus, the column should still be generated but left open if the necessary data is not found.

I have tried using strsplit, however i can't find a delimiter that works for all of them and this is as far as I have gotten

library("Biostrings")

fastaFile <- readDNAStringSet("~/ex1/headers.txt")
seq_name = names(fastaFile)
df <- data.frame(seq_name)

library(stringr)
df[c('ID', 'Sequence name','Sequence length','Node number','Sequencing type')] <- str_split_fixed(df$seq_name, ' ', 5)

df <- df[c('ID', 'Sequencing name', 'Sequence length','Node number','Sequencing type')]

The data frame should look like this

ID	Sequence name	Node number	Sequence length	Sequencing type
RYQS01000239.1:1916-2698	Klebsiella pneumoniae strain 16HN-12	NODE_239	length_2763_cov_7.539092	whole genome shotgun sequence

CodePudding user response：

You could use a kind of look-behind to the colon plus some text until the space. Since (?<=:\w -\w ) won't work because of varying length, we may use \K which resets the match at this point. The other regex are straightforward.

readLines('headers.txt') |>
  {\(.) .[. != '']}() |>
  strsplit(':\\w -\\w \\K\\s|\\sNODE_|_length_|_cov_|,\\s', perl=TRUE) |>
  {\(.) lapply(., `length<-`, max(lengths(.)))}() |>
  lapply(\(x) {g <- grepl('genome', x);if (any(g)) {x[length(x)] <- x[g]; x[g] <- NA};x}) |> 
  do.call(what='rbind.data.frame') |>
  type.convert(as.is=TRUE) |>
  setNames(c('ID', 'name', 'node', 'length', 'cov', 'type'))
#                               ID                                                 name node length        cov                          type
# 1    NZ_MCQZ01000071.1:2282-2767      Klebsiella pneumoniae strain TR196 Scaffold45_1   NA     NA         NA whole genome shotgun sequence
# 2           RYOH01000117.1:3-590                Klebsiella pneumoniae strain 16WZ-131  117   2026 233.332478                          <NA>
# 3           RYOJ01000145.1:3-857                Klebsiella pneumoniae strain 16WZ-128  145   2293 224.091606                          <NA>
# 4 NZ_CABWRH010000049.1:1707-2128 Klebsiella pneumoniae strain SRRSH43 isolate SRRSH43   NA     NA         NA whole genome shotgun sequence
# 5       RYQS01000239.1:1916-2698                 Klebsiella pneumoniae strain 16HN-12  239   2763   7.539092                          <NA>

See the demo.

Note: R >= 4.1 used.