Home > Software engineering >  Regex lookaround function with irrelevant text in the middle
Regex lookaround function with irrelevant text in the middle

Time:03-24

My text should contain tip then top, Additionally, if tap is between tip and top (in that order, ie tip...tap...top), then no other top can be between tip and tap (in that order ie tip...top...tap...top is forbidden).

Some examples

1. "tip tip top tip tip" TRUE
2. "top tip tup tip tap top" TRUE
3. "tip top tap tap top" FALSE
4. "tip tup top tap tap top" FALSE
5. "tip top tap tap tip" TRUE

I have tried using lookarounds, eg

condition = (tip.*top) & (tip(?!.*top).*tap.*top)
str_detect("mytext", condition)

but it doesnt work.

Here is a working example:

mytext = c("tip tip top tip tip" , "top tip tup tip tap top" ,
           "tip top tap tap top" , "tip tup top tap tap top" , "tip top tap tap tip" )
condition = "(tip.*top) & (tip(?!.*top).*tap.*top)"
str_detect(mytext, condition)

which gives

[1] FALSE FALSE FALSE FALSE FALSE

rather than T T F F T

CodePudding user response:

What if we do this:

mytext = c("tip tip top tip tip" , "top tip tup tip tap top" ,
 "tip top tap tap top" , "tip tup top tap tap top" , "tip top tap tap tip" )
str_detect(mytext, "tip.*top") & !str_detect(mytext, "tip.*top.*tap.*top")

TRUE
TRUE
FALSE
FALSE
TRUE

CodePudding user response:

@KevinDialdestoro gave the solution I would use, but if you really want it all in one regexp, here's his solution translated into regex language:

str_detect(mytext, "(?=.*tip.*top)(?!.*tip.*top.*tap.*top)")

The (?=...) part is a "non-consuming lookahead", and the (?!...) part is a negation.

EDITED TO ADD: My first posting got it wrong. I think it's fixed now, but that's why Kevin's solution is better: it's obviously correct.

  • Related