Home > Software design >  Python extract substring via regex with marker as delimiter
Python extract substring via regex with marker as delimiter

Time:12-14

In a textfile

1. Notice 
Some text 
End Notice
2. Blabla 
Some other text 
Even more text
3. Notice 
Some more text
End Notice

I would like to extract the text from "2. Blabla" and the following text(lines) with regex.

A section as "2. Blabla" might be in the textile several time (as with "1. Notice" etc.).

I tried

pattern = r"(\d \. Blabla[\S\n\t\v ]*?\d \. )"
re.compile(pattern)
result = re.findall(pattern, text) 
print(result)

but it gives me

['2. BlaBla\nSome other text\nEven more text\n3. ']

How can I get rid of the "3. "?

CodePudding user response:

You can use

(?ms)^\d \. Blabla.*?(?=^\d \. |\Z)

It will match start of a line, one or more digits, a dot, a space, Blabla, and then zero or more chars, as few as possible, till the first occurrence of one or more digits . space at the start of a line, or end of the whole string.

However, there is a faster expression:

(?m)^\d \. Blabla.*(?:\n(?!\d \.).*)*

See the regex demo. Details:

  • ^ - start of a line (due to re.M option in the Python code)
  • \d - one or more digits
  • \. - a dot
  • Blabla - a fixed string
  • .* - the rest of the line
  • (?:\n(?!\d \.).*)* - any zero or more lines that do not start with one or more digits and then a . char.

See the Python demo:

import re
text = "1. Notice \nSome text \nEnd Notice\n2. Blabla \nSome other text \nEven more text\n3. Notice \nSome more text\nEnd Notice"
pattern = r"^\d \. Blabla.*(?:\n(?!\d \.).*)*"
result = re.findall(pattern, text, re.M) 
print(result)
# => ['2. Blabla \nSome other text \nEven more text']
  • Related