Home > Blockchain >  Regex: Is there a way to not consume words that are captured?
Regex: Is there a way to not consume words that are captured?

Time:03-10

I am trying to extract 3 words before and after a given word using regex in python. It works well for most of the cases, but the issue occurs when there are 2 of the same given words within the 3 words region as per the code snippet below (The given word is "hello").

new_text = "I am going to say hello and hello to him"
re.findall(r"((?:[a-zA-Z'-] [^a-zA-Z'-] ){0,3})(hello)((?:[^a-zA-Z'-] [a-zA-Z'-] ){0,3})", new_text)

Expected Output:

[('going to say ', 'hello', ' and hello to'), ('say hello and ', 'hello', ' to him')]

Actual Output:

[('going to say ', 'hello', ' and hello to')]

From my research, it is due to regex consuming the words that it matches and therefore it is not able to process my second "hello". I will need to capture the region as I will be doing additional processing to it.

Any advice on how to proceed will be greatly appreciated (Regex or non-regex).

Thanks!

CodePudding user response:

You can match the following regular expression:

r'(?=\b((?:\w   ){2}\w )  (hello)  ((?:\w   ){0,2}\w \b))'

Demo


There are three capture groups. They contain:

  • a string of three words preceding 'hello'
  • the word 'hello'
  • a string of one to three words following 'hello', as many as possible.

The link shows that there are two matches of the string:

"I am going to say hello and hello to him" 

1st match (zero-width, before 'g' in 'going')

  • Capture group 1: "going to say"
  • Capture group 2: "hello"
  • Capture group 3: "and hello to"

2nd match (zero-width, before 's' in 'say')

  • Capture group 1: "say hello and"
  • Capture group 2: "hello"
  • Capture group 3: "to him"

Note that saving 'hello' to a capture group is really unnecessary because there will not be a match if 'hello' is not present in its required position. Observe also that I have constructed the regular expression in such a way that all capture groups begin and end with a word character (rather than with a space as shown in the question).


The regular expression can be broken down as follows. (Note that I show a space as a character class containing one space, merely to make the space character visible to the reader.)

(?=              # begin positive lookahead
  \b             # match word boundary
  (              # begin capture group 1
    (?:\w [ ] )  # match >= 1 word chars followed by >= 1 spaces 
    {2}          # execute preceding non-capture group 2 times
    \w           # match >= 1 word chars
  )              # end capture group 1
  [ ]            # match >= 1 spaces
  (hello)        # match literal and save to capture group 2
  [ ]            # match >= 1 spaces
  (              # begin capture group 3
    (?:\w [ ] )  # match >= 1 word chars followed by >= 1 spaces 
    {0,2}        # execute preceding non-capture group 0-2 times
    \w           # match >= 1 word chars
    \b           # match a word boundary
  )              # end capture group 3
)                # end positive lookahead
  • Related