How to keep only strings which follows a specific form in a list (Python)-CodePudding

i have a corpus text extracted from pdf file defined in this list below

list=["7.1 PLAN COST MANAGEMENT",'Plan Cost Management is the process of defining how the project costs will be estimated','7.1.1 PLAN COST MANAGEMENT: INPUTS','Described in Section 4.2.3.1. The project charter provides the preapproved financial ','7.1.1.1 PROJECT CHARTER']

However , i wanted to extract only the titles found in this list which owns a specific form as shown in the example [(d.d.d.d upper case title) or (d.d.d upper case title) or (d.d upper case title)] & getting rid of the rest. I don't really know how to encounter this properly. Any help is appreciated

CodePudding user response：

This is a perfect use case for regular expressions. Here's some code to do what you're asking:

import re

list = ["7.1 PLAN COST MANAGEMENT",
        'Plan Cost Management is the process of defining how the project costs will be estimated',
        '7.1.1 PLAN COST MANAGEMENT: INPUTS',
        'Described in Section 4.2.3.1. The project charter provides the preapproved financial ',
        '7.1.1.1 PROJECT CHARTER']

exp = re.compile(r"(\d (\.\d ){2,4})  ([A-Z :] )")

for x in list:
    m = exp.match(x)
    if m:
        print(m.group(0))

Result:

7.1.1 PLAN COST MANAGEMENT: INPUTS
7.1.1.1 PROJECT CHARTER

You weren't clear about what constitutes a valid "upper case title". This solution assumes that the ':' character and whitespace are valid characters in a title. You can adjust what's inside the square braces in the expression to tweak what you do or do not want to consider valid characters in titles.

CodePudding user response：

Try this solution, it gives the exact result you want

rx=re.compile(r'(\d \.){1,3}\d  [A-Z|\s|A-Z] (\d )?\n')