Using Regex to select specific section of a text-CodePudding

Suppose I have the following document:

document1 = '1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC\n\n2. ABC \n2.1 hello ABC\n2.2 bla bla bla\n\n3. XYZ\n3.1 bla bla\n3.2 more bla bla\n3.3 even more bla bla'

which has the following format:

1. Hello world
1.1 bla bla bla
1.2 more bla bla
1.3 even more bla bla ABC

2. ABC 
2.1 hello ABC
2.2 bla bla bla

3. XYZ
3.1 bla bla
3.2 more bla bla
3.3 even more bla bla

I wonder how can I select the ABC section only, such that I get the output as:

2. ABC 
2.1 hello ABC
2.2 bla bla bla

One might suggest doing re.findall(r'^2\..*', document1, re.MULTILINE) but NOTE ABC section doesn't always have to be at point 2. For instance I can have:

document2 = '1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC\n\n2. XYZ\n2.1 bla bla\n2.2 more bla bla\n2.3 even more bla bla\n\n\n3. MNO\n 3.1 hello MNO\n3.2 bla bla bla\n\n\n4. ABC\n4.1 hello ABC\n4.2 bla bla bla'

1. Hello world
1.1 bla bla bla
1.2 more bla bla
1.3 even more bla bla ABC

2. XYZ
2.1 bla bla
2.2 more bla bla
2.3 even more bla bla

3. MNO 
3.1 hello MNO
3.2 bla bla bla

4. ABC 
4.1 hello ABC
4.2 bla bla bla

where ABC is in section 4.

CodePudding user response：

You can use

^\d \.\s*ABC[^\S\n]*(?:\n. )*

See the regex demo. Only pass re.M flag when compiling the regex object. Details:

^ - start of a line
\d - one or more digits
\. - a dot
\s* - zero or more whitespaces
ABC - ABC string
[^\S\n]* - zero or more whitespaces other than an LF char
(?:\n. )* - zero or more non-empty lines.

To get all matches, you can use

matches =  re.findall(r'^\d \.\s*ABC[^\S\n]*(?:\n. )*', document1, re.M)

To get the first match only you can use

match = re.search(r'^\d \.\s*ABC[^\S\n]*(?:\n. )*', document1, re.M)
if match:
    print(match.group())

CodePudding user response：

I would split the text into paragraphs:

>>> document1.split("\n\n")
[
  "1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC",
  "2. ABC \n2.1 hello ABC\n2.2 bla bla bla",
  "3. XYZ\n3.1 bla bla\n3.2 more bla bla\n3.3 even more bla bla"
]

>>> document2.split("\n\n")
[
  "1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC",
  "2. XYZ\n2.1 bla bla\n2.2 more bla bla\n2.3 even more bla bla",
  "\n3. MNO\n 3.1 hello MNO\n3.2 bla bla bla",
  "\n4. ABC\n4.1 hello ABC\n4.2 bla bla bla"
]

Then search for paragraph which contains ". ABC":

found = next((para for para in document1.split("\n\n") if ". ABC" in para), "")

The above works with document2 as well. If you want, you can replace the test ". ABC" in para with re.search(r"\d \. ABC", para).

CodePudding user response：

Here is one way to get it, firstly extract the initial digit for that section and then apply your suggested findall approach. Note that, you need to adjust the code if the section appears more than once.

import re

document1 = '1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC\n\n2. ABC \n2.1 hello ABC\n2.2 bla bla bla\n\n3. XYZ\n3.1 bla bla\n3.2 more bla bla\n3.3 even more bla bla'
document2 = '1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC\n\n2. XYZ\n2.1 bla bla\n2.2 more bla bla\n2.3 even more bla bla\n\n\n3. MNO\n 3.1 hello MNO\n3.2 bla bla bla\n\n\n4. ABC\n4.1 hello ABC\n4.2 bla bla bla'

def get_section(document, substr):
    section_expr = "\d*\. "   substr
    section_no = re.findall(section_expr, document)[0].rsplit('. ', 1)[0]
    subsection_expr = str(section_no)   '\..*'
    return re.findall(subsection_expr, document)

print(get_section(document1, "ABC"))
print(get_section(document2, "ABC"))

Result:

['2. ABC ', '2.1 hello ABC', '2.2 bla bla bla']
['4. ABC', '4.1 hello ABC', '4.2 bla bla bla']