I'm working on an academic research project that requires extracting titles from a Table of Contents. I'm making a Python program to clean up text that looks like this:
BONDS OF LATE:
An act providing the officers of the State of Illinois from making payments on certain bonds ............ 79
An act to provide for publishing a now edition of Dresses Reports ..................................... 78BRIDGES:
An act to provide for the better protection of the public bridges in this State ........................... 74
to look like this:
An act providing the officers of the State of Illinois from making payments on certain bonds .
An act to provide for publishing a now edition of Dresses Reports .
An act to provide for the better protection of the public bridges in this State .
My strategy is to somehow iterate through a text file and delete characters after the first '.' and before the next 'An act'. I thought about trying a nested 'for' loop like this:
for line in file:
for character in line:
But iterating by character makes it impossible to stop at a string (i.e. 'An act'). I'm a beginner to Python (and coding) and would greatly appreciate any help. Are there regular expressions that would help delete all the characters in a line before 'An act' and after the first period? Thank you!
CodePudding user response:
You can use a regular expression that matches lines that start with "An act", followed by a space and at least one character, followed by a period (see this regex101 for more in-depth explanation). We use the non-greedy operator to stop at the first period, and we use ?:
to indicate that there's a group that we don't care about capturing:
import re
with open("data.txt") as file:
for line in file:
search_result = re.search(r"^(An act (?:. ?)\.)", line)
if search_result:
print(search_result.group(1))
This outputs:
An act providing the officers of the State of Illinois from making payments on certain bonds .
An act to provide for publishing a now edition of Dresses Reports .
An act to provide for the better protection of the public bridges in this State .
CodePudding user response:
A solution using regex
and string.replace
>>> import re
>>> lines="""
... BONDS OF LATE:
... An act providing the officers of the State of Illinois from making payments on certain bonds ............ 79
... An act to provide for publishing a now edition of Dresses Reports ..................................... 78
...
... BRIDGES:
... An act to provide for the better protection of the public bridges in this State ........................... 74
... """
>>> m = re.sub(r'\b[A-Z] \b', '', line)
>>> m=m.replace(":","")
>>> m.replace(".","")
>>> m= ''.join(i for i in m if not i.isdigit())
>>> print(m)
An act providing the officers of the State of Illinois from making payments on certain bonds
An act to provide for publishing a now edition of Dresses Reports
An act to provide for the better protection of the public bridges in this State
Adopted from here