Home > Enterprise >  How to keep only strings which follows a specific form in a list (Python)
How to keep only strings which follows a specific form in a list (Python)

Time:10-18

i have a corpus text extracted from pdf file defined in this list below

list=["7.1 PLAN COST MANAGEMENT",'Plan Cost Management is the process of defining how the project costs will be estimated','7.1.1 PLAN COST MANAGEMENT: INPUTS','Described in Section 4.2.3.1. The project charter provides the preapproved financial ','7.1.1.1 PROJECT CHARTER']

However , i wanted to extract only the titles found in this list which owns a specific form as shown in the example [(d.d.d.d upper case title) or (d.d.d upper case title) or (d.d upper case title)] & getting rid of the rest. I don't really know how to encounter this properly. Any help is appreciated

CodePudding user response:

This is a perfect use case for regular expressions. Here's some code to do what you're asking:

import re

list = ["7.1 PLAN COST MANAGEMENT",
        'Plan Cost Management is the process of defining how the project costs will be estimated',
        '7.1.1 PLAN COST MANAGEMENT: INPUTS',
        'Described in Section 4.2.3.1. The project charter provides the preapproved financial ',
        '7.1.1.1 PROJECT CHARTER']

exp = re.compile(r"(\d (\.\d ){2,4})  ([A-Z :] )")

for x in list:
    m = exp.match(x)
    if m:
        print(m.group(0))

Result:

7.1.1 PLAN COST MANAGEMENT: INPUTS
7.1.1.1 PROJECT CHARTER

You weren't clear about what constitutes a valid "upper case title". This solution assumes that the ':' character and whitespace are valid characters in a title. You can adjust what's inside the square braces in the expression to tweak what you do or do not want to consider valid characters in titles.

CodePudding user response:

Try this solution, it gives the exact result you want

rx=re.compile(r'(\d \.){1,3}\d  [A-Z|\s|A-Z] (\d )?\n')
  • Related