Suppose I have the following document:
document1 = '1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC\n\n2. ABC \n2.1 hello ABC\n2.2 bla bla bla\n\n3. XYZ\n3.1 bla bla\n3.2 more bla bla\n3.3 even more bla bla'
which has the following format:
1. Hello world
1.1 bla bla bla
1.2 more bla bla
1.3 even more bla bla ABC
2. ABC
2.1 hello ABC
2.2 bla bla bla
3. XYZ
3.1 bla bla
3.2 more bla bla
3.3 even more bla bla
I wonder how can I select the ABC section only
, such that I get the output as:
2. ABC
2.1 hello ABC
2.2 bla bla bla
One might suggest doing re.findall(r'^2\..*', document1, re.MULTILINE)
but NOTE ABC section
doesn't always have to be at point 2. For instance I can have:
document2 = '1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC\n\n2. XYZ\n2.1 bla bla\n2.2 more bla bla\n2.3 even more bla bla\n\n\n3. MNO\n 3.1 hello MNO\n3.2 bla bla bla\n\n\n4. ABC\n4.1 hello ABC\n4.2 bla bla bla'
1. Hello world
1.1 bla bla bla
1.2 more bla bla
1.3 even more bla bla ABC
2. XYZ
2.1 bla bla
2.2 more bla bla
2.3 even more bla bla
3. MNO
3.1 hello MNO
3.2 bla bla bla
4. ABC
4.1 hello ABC
4.2 bla bla bla
where ABC
is in section 4.
CodePudding user response:
You can use
^\d \.\s*ABC[^\S\n]*(?:\n. )*
See the regex demo. Only pass re.M
flag when compiling the regex object. Details:
^
- start of a line\d
- one or more digits\.
- a dot\s*
- zero or more whitespacesABC
-ABC
string[^\S\n]*
- zero or more whitespaces other than an LF char(?:\n. )*
- zero or more non-empty lines.
To get all matches, you can use
matches = re.findall(r'^\d \.\s*ABC[^\S\n]*(?:\n. )*', document1, re.M)
To get the first match only you can use
match = re.search(r'^\d \.\s*ABC[^\S\n]*(?:\n. )*', document1, re.M)
if match:
print(match.group())
CodePudding user response:
I would split the text into paragraphs:
>>> document1.split("\n\n")
[
"1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC",
"2. ABC \n2.1 hello ABC\n2.2 bla bla bla",
"3. XYZ\n3.1 bla bla\n3.2 more bla bla\n3.3 even more bla bla"
]
>>> document2.split("\n\n")
[
"1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC",
"2. XYZ\n2.1 bla bla\n2.2 more bla bla\n2.3 even more bla bla",
"\n3. MNO\n 3.1 hello MNO\n3.2 bla bla bla",
"\n4. ABC\n4.1 hello ABC\n4.2 bla bla bla"
]
Then search for paragraph which contains ". ABC":
found = next((para for para in document1.split("\n\n") if ". ABC" in para), "")
The above works with document2 as well. If you want, you can replace the test ". ABC" in para
with re.search(r"\d \. ABC", para)
.
CodePudding user response:
Here is one way to get it, firstly extract the initial digit for that section and then apply your suggested findall approach. Note that, you need to adjust the code if the section appears more than once.
import re
document1 = '1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC\n\n2. ABC \n2.1 hello ABC\n2.2 bla bla bla\n\n3. XYZ\n3.1 bla bla\n3.2 more bla bla\n3.3 even more bla bla'
document2 = '1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC\n\n2. XYZ\n2.1 bla bla\n2.2 more bla bla\n2.3 even more bla bla\n\n\n3. MNO\n 3.1 hello MNO\n3.2 bla bla bla\n\n\n4. ABC\n4.1 hello ABC\n4.2 bla bla bla'
def get_section(document, substr):
section_expr = "\d*\. " substr
section_no = re.findall(section_expr, document)[0].rsplit('. ', 1)[0]
subsection_expr = str(section_no) '\..*'
return re.findall(subsection_expr, document)
print(get_section(document1, "ABC"))
print(get_section(document2, "ABC"))
Result:
['2. ABC ', '2.1 hello ABC', '2.2 bla bla bla']
['4. ABC', '4.1 hello ABC', '4.2 bla bla bla']