I have a follow-up question on a previous answer that can be found here: Split uneven string in R - variable substring and delimiters
In summary, I wanted to extract the bolded text in a string that follows this pattern:
sp|Q2UVX4|CO3_BOVIN **Complement C3** OS=Bos taurus OX=9913 GN=**C3** PE=1 SV=2
Here is a piece of the answer provided by Martin Gal:
protein_name = ifelse(str_detect(string, ".*_BOVIN\\s(.*?)\\sOS=.*"),
str_replace(string, ".*_BOVIN\\s(.*?)\\sOS=.*", "\\1"),
NA_character_),
His answer was excellent, but sometimes I have a mix of species (e.g.: BOVIN and HUMAN), so I wanted to make the code a bit more flexible. I tried with only space (\\s)
and capital letters with space ([A-Z]\\s)
but the first failed and the second was inaccurate for some strings. Then I mixed Martin's approach with a string ending in capital letters, aiming to select the entire first chunk as the delimiter (e.g.: sp|Q2UVX4|CO3_BOVIN).
To this:
protein_name = ifelse(str_detect(string, "[a-z]{2}\\|(.*?)[A-Z]\\s(.*?)\\sOS=.*"),
str_replace(string, "[a-z]{2}\\|(.*?)[A-Z]\\s(.*?)\\sOS=.*", "\\2")
- In this case, what would be the best way to select everything in between the two patterns? The two patterns are "sp" and capital letter followed by one space.
- I used
(.*?)
, is this the best approach?
CodePudding user response:
This can be solved as follows:
str_extract_all(string, "(?<=(?:BOVIN|HUMAN) )(.*?)(?= OS).*?GN=(\\w )") %>%
map_df(~read.table(text=str_replace(.,"OS.*GN", ""), sep="=",
col.names = c('protein_name', 'gene')), .id='grp')
grp protein_name gene
1 1 Complement C3 C3
2 1 C3-beta-c C3
3 1 C3-beta-c C3
4 2 Haptoglobin HP
5 2 Haptoglobin HP
6 2 Haptoglobin HP
7 3 Anion exchange protein SLC4A7
8 4 Isoform V3 of Versican core protein VCAN
9 4 Isoform V2 of Versican core protein VCAN
10 4 Versican core protein VCAN
11 5 Keratin 10 (Epidermolytic hyperkeratosis; keratosis palmaris et plantaris) KRT10
12 5 Keratin, type I cytoskeletal 10 KRT10
You could also use the following. Note that as_tibble
is not necessary. Used it for pretty results
unlist(strsplit(string, "\\w{2}=\\w \\K;", perl = TRUE))%>%
sub(".*?(?:BOVIN|HUMAN) (.*?)(?= OS).*?GN=(\\w ).*|.*", "\\1:\\2", ., perl = TRUE) %>%
read.table(text=., sep=":") %>%
as_tibble()
A tibble: 14 x 2
V1 V2
<chr> <chr>
1 "Complement C3" "C3"
2 "C3-beta-c" "C3"
3 "C3-beta-c" "C3"
4 "" ""
5 "Haptoglobin" "HP"
6 "Haptoglobin" "HP"
7 "Haptoglobin" "HP"
8 "" ""
9 "Anion exchange protein" "SLC4A7"
10 "Isoform V3 of Versican core protein" "VCAN"
11 "Isoform V2 of Versican core protein" "VCAN"
12 "Versican core protein" "VCAN"
13 "Keratin 10 (Epidermolytic hyperkeratosis; keratosis palmaris et plantaris)" "KRT10"
14 "Keratin, type I cytoskeletal 10" "KRT10"
CodePudding user response:
Your "best" pattern is always the one that meets all your requirements. So, always start from defining the requirements: the match should start with..., the following chars can appear here, there... and the match should end with...
So, in your case, it seems you discard intermediate checks and just use
library(stringr)
str_match(string, '[a-z]{2}\\|[|\\w]*[A-Z]\\s (.*?)\\s OS=')[,2]
As the stringr::str_match
keeps all captures, it helps immensely when you have to match some pattern inside a complext context. [,2]
access the contents of Group 1.
The regex matches:
[a-z]{2}
- two lowercase ASCII letters (here, there is no problem with performance, when you tell the regex to match a single char repeated X times, this is very efficient)\|
- a|
char (again, this is fine, a literal is matched efficiently)[|\w]*
- zero or more|
or word chars (this is backtracking prone since the next pattern matches an uppercase letter, which is also a word char, but here, we need this backtracking)[A-Z]
- an uppercase ASCII letter\s
- one or more whitespace chars(.*?)
- Group 1: zero or more chars other than line break chars as few as possible (this is the most resource consuming pattern here, as it will be expanded char after char if the subsequent patterns fail to match; also, it does not match line breaks by default, if you have line breaks, you need((?s:.*?))
)\s
- one or more whitespace charsOS=
- aOS=
substring.
See the regex demo. See the R demo:
string <- 'sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV=2'
library(stringr)
str_match(string, '[a-z]{2}\\|[|\\w]*[A-Z]\\s (.*?)\\s OS=')[,2]
Output:
# => [1] "Complement C3"
If you need to optimize the .*?
pattern, you need to read more about and learn to use unroll-the-loop approach. Tl;dr:
[a-z]{2}\|[|\w]*[A-Z]\s (\S*(?:\s(?!\s*OS=)\S*)*)\s OS=
See this regex demo.
The .*?
is transformed into \S*(?:\s(?!\s*OS=)\S*)*
(see the subsequent pattern is "sewn into" this construct), which matches
\S*
- zero or more non-whitespace chars(?:\s(?!\s*OS=)\S*)*
- zero or more sequences of any whitespace that is not immediately followed with zero or more whitespaces andOS=
, and then again zero or more non-whitespace chars.