In a textfile
1. Notice
Some text
End Notice
2. Blabla
Some other text
Even more text
3. Notice
Some more text
End Notice
I would like to extract the text from "2. Blabla" and the following text(lines) with regex.
A section as "2. Blabla" might be in the textile several time (as with "1. Notice" etc.).
I tried
pattern = r"(\d \. Blabla[\S\n\t\v ]*?\d \. )"
re.compile(pattern)
result = re.findall(pattern, text)
print(result)
but it gives me
['2. BlaBla\nSome other text\nEven more text\n3. ']
How can I get rid of the "3. "?
CodePudding user response:
You can use
(?ms)^\d \. Blabla.*?(?=^\d \. |\Z)
It will match start of a line, one or more digits, a dot, a space, Blabla
, and then zero or more chars, as few as possible, till the first occurrence of one or more digits .
space at the start of a line, or end of the whole string.
However, there is a faster expression:
(?m)^\d \. Blabla.*(?:\n(?!\d \.).*)*
See the regex demo. Details:
^
- start of a line (due tore.M
option in the Python code)\d
- one or more digits\.
- a dotBlabla
- a fixed string.*
- the rest of the line(?:\n(?!\d \.).*)*
- any zero or more lines that do not start with one or more digits and then a.
char.
See the Python demo:
import re
text = "1. Notice \nSome text \nEnd Notice\n2. Blabla \nSome other text \nEven more text\n3. Notice \nSome more text\nEnd Notice"
pattern = r"^\d \. Blabla.*(?:\n(?!\d \.).*)*"
result = re.findall(pattern, text, re.M)
print(result)
# => ['2. Blabla \nSome other text \nEven more text']