I am parsing long strings with semicolons and quotes using R v4.0.0 and stringi
. Here is an example string:
tstr1 <- 'gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; inference "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; partial "true"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'
I would like to extract a quoted substring by first matching a variable pattern var
and then extracting everything until the next semicolon. I would like to avoid matching instances of var
that are within quoted substrings. So far, I have this:
library(stringi)
library(dplyr)
var <- "partial"
str_extract(string = tstr1, pattern = paste0('"; ', var, '[^;] ')) %>%
gsub(paste0("\"; ", var), "", .) %>%
gsub("\"", "", .) %>% trimws()
This returns "true"
, which is my desired output. However, I need a regex that also works in two edge cases:
Case 1
When var
is at the beginning of the string and I can't rely on a preceding ";
to match.
tstr2 <- 'partial "true"; gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; infernce "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'
Expected output: "true"
Case 2
When the quoted substring to be extracted contains a semicolon, I would want to match everything until the next semicolon that is not within the quoted substring.
tstr3 <- 'partial "true; foo"; gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; infernce "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'
Expected output: "true; foo"
CodePudding user response:
We may use an OR (|
) condition for cases where the 'partial' doesn't have any preceding "
or ;
, and then extract the characters between the two "
library(stringr)
str_extract(tstr, sprintf('";\\s %1$s[^;] |^%1$s[^;] ;[^"] "', var)) %>%
trimws(whitespace = '["; ] ', which = 'left') %>%
str_extract('(?<=")[^"] (?=")')
-output
[1] "true" "true" "true; foo"
data
tstr <- c(tstr1, tstr2, tstr3)