Home > Back-end >  Extract text between strings using regex
Extract text between strings using regex

Time:07-09

I'm trying to extract from the text below the value next to number and the text in between.

Text:
The conditions are: number 1, the patient is allergic to dust, number next, the patient has bronchitis, number 4, The patient heart rate is high.

From this text I want to extract the following values:

  • 1, the patient is allergic to dust,
  • next, the patient has bronchitis,
  • 4, The patient heart rate is high

I have a pattern that allows me to get the value next to number and the first word of the sentence:

(numbers? (\d |next)[,.]?\s?(\w ))

This is the result using re.findall

[('number 1, the', '1', 'the'),
 ('number next, the', 'next', 'the'),
 ('number 4, The', '4', 'The')]

As you can see, using groups I can extract the digit or next value from the text. But I have not been able to extract the entire sentence.

CodePudding user response:

As your . and , and the whitespace chars are optional after the digits or next, you might write the pattern with a non greedy dot asserting numbers again to the right or the end of the string.

\bnumbers? (\d |next)[,.]?\s?(\w.*?)(?= numbers?\b|\.?$)

Regex demo

import re
 
pattern = r"\bnumbers? (\d |next)[,.]?\s?(\w.*?)(?= numbers?\b|\.?$)"
 
s = "The conditions are: number 1, the patient is allergic to dust, number next, the patient has bronchitis, number 4, The patient heart rate is high."
 
print(re.findall(pattern, s))

Output

[
    ('1', 'the patient is allergic to dust,'),
    ('next', 'the patient has bronchitis,'),
    ('4', 'The patient heart rate is high')
]

CodePudding user response:

Try (regex101):

import re

s = "The conditions are: number 1, the patient is allergic to dust, number next, the patient has bronchitis, number 4, The patient heart rate is high."

pat = re.compile(r"numbers? (\d |next)[,.]?\s?([^[,.] )")

print(pat.findall(s))

Prints:

[
    ("1", "the patient is allergic to dust"),
    ("next", "the patient has bronchitis"),
    ("4", "The patient heart rate is high"),
]
  • Related