I have the following text: (this is closely related to this but not similar)
text = '7\n\x0c\n7.\tA B C\n\n7.1\tbla bla bla .\n\n7.2\tanother bla bla \n\n7.3\tand another one.\n\n8.\tX Y Z\n\n8.1\tha ha ha \n\n(a)\thohoho ;\n\n(b)\thihihi,\n\n8'
I wish to select the section 7 only such that I get:
7. A B C
7.1 bla bla bla .
7.2 another bla bla
7.3 and another one.
So I do:
print(re.findall(r'^\d \.\s*A B C[^\S\n]*(?:\n\n. )*', text, re.M)[0])
which gives:
7. A B C
7.1 bla bla bla .
7.2 another bla bla
7.3 and another one.
8. X Y Z
8.1 ha ha ha
(a) hohoho ;
(b) hihihi,
8
As you can see 8
comes after 8.1
. So this seems to be confusing for the regex, I wonder what can I do in this case?
Note that the number of the sections can be different in general, so I can not do something like
re.findall(r'^7\..*', text, re.MULTILINE)
(namely A B C can be places in other sections).
CodePudding user response:
You can use
^(\d \.)\s*A B C(?:\s*\n\1\b.*)*
^
Start of string(\d \.)\s*A B C
Capture group 1 to match 1 digits and.
, then matchA B C
(?:
Non capture group to match as a whole\s*\n
Match optional whitespace chars and a newline\1\b
A backreference to group 1 followed by a word boundary.*
Match the rest of the line
)*
Close the non capture group and optionally repeat it
See a regex demo.
import re
text = '7\n\x0c\n7.\tA B C\n\n7.1\tbla bla bla .\n\n7.2\tanother bla bla \n\n7.3\tand another one.\n\n8.\tX Y Z\n\n8.1\tha ha ha \n\n(a)\thohoho ;\n\n(b)\thihihi,\n\n8'
pattern = r"^(\d \.)\s*A B C(?:\s*\n\1\b.*)*"
m = re.search(pattern, text, re.MULTILINE)
if m:
print(m.group())
Output
7. A B C
7.1 bla bla bla .
7.2 another bla bla
7.3 and another one.
CodePudding user response:
Here is a potential solution to find the lines between 7.
and the first 8.
:
m = re.search(r'^7\.\s*(?:.|\n) ?(?=\n8\.)', text, re.MULTILINE)
print(m.group() if m else None)
output:
7. A B C
7.1 bla bla bla .
7.2 another bla bla
7.3 and another one.