I'm trying to use stringr/dplyr to extract a pathway name from a table cell containing excess information. All cells in this table follow the same general format. Some examples are:
(R)-lactate from methylglyoxal: step 1/2. {ECO:0000256|ARBA:ARBA00005008, ECO:0000256|RuleBase:RU361179}.
(S)-dihydroorotate from bicarbonate: step 3/3. {ECO:0000256|ARBA:ARBA00004880}.
3,4',5-trihydroxystilbene biosynthesis
From these examples, I want to extract "(R)-lactate from methylglyoxal", "(S)-dihydroorotate from bicarbonate", and "3,4',5-trihydroxystilbene biosynthesis" respectively. I'm struggling to figure out which combination of regular expressions to use in order to accomplish this. I've been trying to use the positive look behind assertion ?<=...
along with str_extract
to extract all information preceding the first ":", but I can't get it to work. Any help would be appreciated!
CodePudding user response:
please try the following pattern:
(?<=^)(. ?)(:|$)
(?<=^) the first part is looking exclusively at the beginning of the sentence (. ?)(:|$) the second part is looking for at least one character before first ":" or end of sentence
CodePudding user response:
You don't need any lookarounds, you can match the values using:
^[^\r\n:]
The pattern matches:
^
Start of string[^\r\n:]
Match 1 chars other than newlines or:
library(stringr)
s <- c("(R)-lactate from methylglyoxal: step 1/2. {ECO:0000256|ARBA:ARBA00005008, ECO:0000256|RuleBase:RU361179}.",
"(S)-dihydroorotate from bicarbonate: step 3/3. {ECO:0000256|ARBA:ARBA00004880}.",
"3,4',5-trihydroxystilbene biosynthesis")
str_extract(s, "^[^\\r\\n:] ")
Output
[1] "(R)-lactate from methylglyoxal"
[2] "(S)-dihydroorotate from bicarbonate"
[3] "3,4',5-trihydroxystilbene biosynthesis"