Home > Back-end >  Regex: Alternative to .* expression to define word sequence
Regex: Alternative to .* expression to define word sequence

Time:06-25

i am currently trying to match specific sentences based on the words they contain and their order. I am doing this mostly with the lookahead assertion based on this structure:

[^>.\]]*(?="The desired Words)[^<.\]]*

So i am for example looking for sentences that talk about Vacation in the Maledives. To match the sentence: i´ll book a vacation to the Maledives. I could look for sentences that contain the word vacation and afterwards maledives.

The expression [^>.\]]*(?=([Vv]cation.*[Mm]aledives))[^<.\]]* using .* causes problems because the word "Maledives" can also appear in later sentences (Example Wrong)

My solutionw as to use the expression ([,'`´() \\]*\w ){0,X}\s* instead of .*, to indicate that "Maledives" has to follow "vacation" within the same sentences and with a maximum of X words between them, changing to this structure:

[^>.\]]*(?=([Vv]cation([,'`´() \\]*\w ){0,X}\s*[Mm]aledives))[^<.\]]* (Example Correct)

Unfortunately this expression is quite computationally intensive and leads to catastrophic backtracking if the range {0,X} is set to high.

Do you have any other suggestions how to look for sentences containing specific words in order?

CodePudding user response:

You can try something like this.

(?:^|\.)\s*([^.]*[Vv]acation[^.]*[Mm]aledives[^.]*(?:\.|$))

See demo.

https://regex101.com/r/97UvCH/1

There is no need for lookahead .It will slow things down.

CodePudding user response:

Your pattern is prone to catastrophic backtracking as there are nested quantifiers in the {0,3} repeating part and there are also leading optional quantifiers at the start of the pattern.

Python re does not support possessive quantifiers or atomic groups, but you can mimic that using a capture group in a lookahead assertion, and then use the backreference to that group when the assertion is true for the first part to reduce the backtracking a bit.

But the second part with the quantifier {0,200} should not be atomic because you want to allow backtracking to fit a variable number of words before matching maledives.

So the higher the number for the quantifier will be, the more possible paths are there to explore.

(?<!\S)(?=([^<>.\]]*[Vv]acation\b))\1(?:[,'`´() \\] \w ){0,200}\s*[Mm]aledives\b[^<.\]]*

The pattern matches:

  • (?<!\S) Assert a whitespace boundary to the left
  • (?= Positive lookahead assertion, assert what is to the right is
    • ( Capture group 1
      • [^<>.\]]*[Vv]acation\b Match optional chars other than the listed and then match vacation followed by a word boundary
    • ) Close group 1
  • ) Close the lookahead
  • \1 Match a backreference to group 1 (that is matched in the lookahead)
  • (?:[,'`´() \\] \w ){0,200} Repeat 0-n times one or more chars from the character class and then 1 word characters
  • \s*[Mm]aledives Match optional whitespace chars and then maledives
  • [^<.\]]* Optionally match any character other than the listed in the character class

See a regex demo.


Another option could be using the PyPi regex module with an atomic group (?> for the first part:

(?<!\S)(?>[^<>.\]]*[Vv]acation\b)(?:[,'`´() \\] \w ){0,3}\s*[Mm]aledives[^<.\]]*

See a Python demo

  • Related