Optimizing a regex in R for substring extraction-CodePudding

I have a follow-up question on a previous answer that can be found here: Split uneven string in R - variable substring and delimiters

In summary, I wanted to extract the bolded text in a string that follows this pattern:

sp|Q2UVX4|CO3_BOVIN **Complement C3** OS=Bos taurus OX=9913 GN=**C3** PE=1 SV=2

Here is a piece of the answer provided by Martin Gal:

protein_name = ifelse(str_detect(string, ".*_BOVIN\\s(.*?)\\sOS=.*"), 
                      str_replace(string, ".*_BOVIN\\s(.*?)\\sOS=.*", "\\1"),
                      NA_character_),

His answer was excellent, but sometimes I have a mix of species (e.g.: BOVIN and HUMAN), so I wanted to make the code a bit more flexible. I tried with only space (\\s) and capital letters with space ([A-Z]\\s) but the first failed and the second was inaccurate for some strings. Then I mixed Martin's approach with a string ending in capital letters, aiming to select the entire first chunk as the delimiter (e.g.: sp|Q2UVX4|CO3_BOVIN).

To this:

protein_name = ifelse(str_detect(string, "[a-z]{2}\\|(.*?)[A-Z]\\s(.*?)\\sOS=.*"), 
                      str_replace(string, "[a-z]{2}\\|(.*?)[A-Z]\\s(.*?)\\sOS=.*", "\\2")

In this case, what would be the best way to select everything in between the two patterns? The two patterns are "sp" and capital letter followed by one space.
I used (.*?), is this the best approach?

CodePudding user response：

This can be solved as follows:

str_extract_all(string, "(?<=(?:BOVIN|HUMAN) )(.*?)(?= OS).*?GN=(\\w )") %>%
   map_df(~read.table(text=str_replace(.,"OS.*GN", ""), sep="=",
             col.names = c('protein_name', 'gene')), .id='grp')
   grp                                                                protein_name   gene
1    1                                                              Complement C3      C3
2    1                                                                  C3-beta-c      C3
3    1                                                                  C3-beta-c      C3
4    2                                                                Haptoglobin      HP
5    2                                                                Haptoglobin      HP
6    2                                                                Haptoglobin      HP
7    3                                                     Anion exchange protein  SLC4A7
8    4                                        Isoform V3 of Versican core protein    VCAN
9    4                                        Isoform V2 of Versican core protein    VCAN
10   4                                                      Versican core protein    VCAN
11   5 Keratin 10 (Epidermolytic hyperkeratosis; keratosis palmaris et plantaris)   KRT10
12   5                                            Keratin, type I cytoskeletal 10   KRT10

You could also use the following. Note that as_tibble is not necessary. Used it for pretty results

unlist(strsplit(string, "\\w{2}=\\w \\K;", perl = TRUE))%>%
   sub(".*?(?:BOVIN|HUMAN) (.*?)(?= OS).*?GN=(\\w ).*|.*",  "\\1:\\2", ., perl = TRUE) %>%
   read.table(text=., sep=":") %>%
   as_tibble()

 A tibble: 14 x 2
   V1                                                                           V2      
   <chr>                                                                        <chr>   
 1 "Complement C3"                                                              "C3"    
 2 "C3-beta-c"                                                                  "C3"    
 3 "C3-beta-c"                                                                  "C3"    
 4 ""                                                                           ""      
 5 "Haptoglobin"                                                                "HP"    
 6 "Haptoglobin"                                                                "HP"    
 7 "Haptoglobin"                                                                "HP"    
 8 ""                                                                           ""      
 9 "Anion exchange protein"                                                     "SLC4A7"
10 "Isoform V3 of Versican core protein"                                        "VCAN"  
11 "Isoform V2 of Versican core protein"                                        "VCAN"  
12 "Versican core protein"                                                      "VCAN"  
13 "Keratin 10 (Epidermolytic hyperkeratosis; keratosis palmaris et plantaris)" "KRT10" 
14 "Keratin, type I cytoskeletal 10"                                            "KRT10"

CodePudding user response：

Your "best" pattern is always the one that meets all your requirements. So, always start from defining the requirements: the match should start with..., the following chars can appear here, there... and the match should end with...

So, in your case, it seems you discard intermediate checks and just use

library(stringr)
str_match(string, '[a-z]{2}\\|[|\\w]*[A-Z]\\s (.*?)\\s OS=')[,2]

As the stringr::str_match keeps all captures, it helps immensely when you have to match some pattern inside a complext context. [,2] access the contents of Group 1.

The regex matches:

[a-z]{2} - two lowercase ASCII letters (here, there is no problem with performance, when you tell the regex to match a single char repeated X times, this is very efficient)
\| - a | char (again, this is fine, a literal is matched efficiently)
[|\w]* - zero or more | or word chars (this is backtracking prone since the next pattern matches an uppercase letter, which is also a word char, but here, we need this backtracking)
[A-Z] - an uppercase ASCII letter
\s - one or more whitespace chars
(.*?) - Group 1: zero or more chars other than line break chars as few as possible (this is the most resource consuming pattern here, as it will be expanded char after char if the subsequent patterns fail to match; also, it does not match line breaks by default, if you have line breaks, you need ((?s:.*?)))
\s - one or more whitespace chars
OS= - a OS= substring.

See the regex demo. See the R demo:

string <- 'sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV=2'
library(stringr)
str_match(string, '[a-z]{2}\\|[|\\w]*[A-Z]\\s (.*?)\\s OS=')[,2]

Output:

# => [1] "Complement C3"

If you need to optimize the .*? pattern, you need to read more about and learn to use unroll-the-loop approach. Tl;dr:

[a-z]{2}\|[|\w]*[A-Z]\s (\S*(?:\s(?!\s*OS=)\S*)*)\s OS=

See this regex demo.

The .*? is transformed into \S*(?:\s(?!\s*OS=)\S*)* (see the subsequent pattern is "sewn into" this construct), which matches

\S* - zero or more non-whitespace chars
(?:\s(?!\s*OS=)\S*)* - zero or more sequences of any whitespace that is not immediately followed with zero or more whitespaces and OS=, and then again zero or more non-whitespace chars.