Home > OS >  Cleaning up a Table of Contents to extract just the Titles using Python?
Cleaning up a Table of Contents to extract just the Titles using Python?

Time:06-19

I'm working on an academic research project that requires extracting titles from a Table of Contents. I'm making a Python program to clean up text that looks like this:

BONDS OF LATE:
An act providing the officers of the State of Illinois from making payments on certain bonds ............ 79
An act to provide for publishing a now edition of Dresses Reports ..................................... 78

BRIDGES:
An act to provide for the better protection of the public bridges in this State ........................... 74

to look like this:

An act providing the officers of the State of Illinois from making payments on certain bonds .

An act to provide for publishing a now edition of Dresses Reports .

An act to provide for the better protection of the public bridges in this State .

My strategy is to somehow iterate through a text file and delete characters after the first '.' and before the next 'An act'. I thought about trying a nested 'for' loop like this:

for line in file:
    for character in line:

But iterating by character makes it impossible to stop at a string (i.e. 'An act'). I'm a beginner to Python (and coding) and would greatly appreciate any help. Are there regular expressions that would help delete all the characters in a line before 'An act' and after the first period? Thank you!

CodePudding user response:

You can use a regular expression that matches lines that start with "An act", followed by a space and at least one character, followed by a period (see this regex101 for more in-depth explanation). We use the non-greedy operator to stop at the first period, and we use ?: to indicate that there's a group that we don't care about capturing:

import re

with open("data.txt") as file:
    for line in file:
        search_result = re.search(r"^(An act (?:. ?)\.)", line)
        if search_result:
            print(search_result.group(1))

This outputs:

An act providing the officers of the State of Illinois from making payments on certain bonds .
An act to provide for publishing a now edition of Dresses Reports .
An act to provide for the better protection of the public bridges in this State .

CodePudding user response:

A solution using regex and string.replace

>>> import re 
>>> lines="""
... BONDS OF LATE:
... An act providing the officers of the State of Illinois from making payments on certain bonds ............ 79
... An act to provide for publishing a now edition of Dresses Reports  ..................................... 78
... 
... BRIDGES:
... An act to provide for the better protection of the public bridges in this State ........................... 74
... """

>>> m = re.sub(r'\b[A-Z] \b', '', line)
>>> m=m.replace(":","")
>>> m.replace(".","")
>>> m= ''.join(i for i in m if not i.isdigit())

>>> print(m)

An act providing the officers of the State of Illinois from making payments on certain bonds  
An act to provide for publishing a now edition of Dresses Reports  

An act to provide for the better protection of the public bridges in this State   

Adopted from here

  • Related