Extracting a specific identifier from a column containing excess information-CodePudding

I'm trying to use stringr/dplyr to extract a pathway name from a table cell containing excess information. All cells in this table follow the same general format. Some examples are:

(R)-lactate from methylglyoxal: step 1/2. {ECO:0000256|ARBA:ARBA00005008, ECO:0000256|RuleBase:RU361179}.

(S)-dihydroorotate from bicarbonate: step 3/3. {ECO:0000256|ARBA:ARBA00004880}.

3,4',5-trihydroxystilbene biosynthesis

From these examples, I want to extract "(R)-lactate from methylglyoxal", "(S)-dihydroorotate from bicarbonate", and "3,4',5-trihydroxystilbene biosynthesis" respectively. I'm struggling to figure out which combination of regular expressions to use in order to accomplish this. I've been trying to use the positive look behind assertion ?<=... along with str_extract to extract all information preceding the first ":", but I can't get it to work. Any help would be appreciated!

CodePudding user response：

please try the following pattern:

(?<=^)(. ?)(:|$)

(?<=^) the first part is looking exclusively at the beginning of the sentence (. ?)(:|$) the second part is looking for at least one character before first ":" or end of sentence

enter image description here

CodePudding user response：

You don't need any lookarounds, you can match the values using:

^[^\r\n:]

The pattern matches:

^ Start of string
[^\r\n:] Match 1 chars other than newlines or :

Regex demo

library(stringr)

s <- c("(R)-lactate from methylglyoxal: step 1/2. {ECO:0000256|ARBA:ARBA00005008, ECO:0000256|RuleBase:RU361179}.",
"(S)-dihydroorotate from bicarbonate: step 3/3. {ECO:0000256|ARBA:ARBA00004880}.",
"3,4',5-trihydroxystilbene biosynthesis")
str_extract(s, "^[^\\r\\n:] ")

Output

[1] "(R)-lactate from methylglyoxal"        
[2] "(S)-dihydroorotate from bicarbonate"   
[3] "3,4',5-trihydroxystilbene biosynthesis"