i have a corpus text extracted from pdf file defined in this list below
list=["7.1 PLAN COST MANAGEMENT",'Plan Cost Management is the process of defining how the project costs will be estimated','7.1.1 PLAN COST MANAGEMENT: INPUTS','Described in Section 4.2.3.1. The project charter provides the preapproved financial ','7.1.1.1 PROJECT CHARTER']
However , i wanted to extract only the titles found in this list which owns a specific form as shown in the example [(d.d.d.d upper case title) or (d.d.d upper case title) or (d.d upper case title)]
& getting rid of the rest. I don't really know how to encounter this properly.
Any help is appreciated
CodePudding user response:
This is a perfect use case for regular expressions. Here's some code to do what you're asking:
import re
list = ["7.1 PLAN COST MANAGEMENT",
'Plan Cost Management is the process of defining how the project costs will be estimated',
'7.1.1 PLAN COST MANAGEMENT: INPUTS',
'Described in Section 4.2.3.1. The project charter provides the preapproved financial ',
'7.1.1.1 PROJECT CHARTER']
exp = re.compile(r"(\d (\.\d ){2,4}) ([A-Z :] )")
for x in list:
m = exp.match(x)
if m:
print(m.group(0))
Result:
7.1.1 PLAN COST MANAGEMENT: INPUTS
7.1.1.1 PROJECT CHARTER
You weren't clear about what constitutes a valid "upper case title". This solution assumes that the ':' character and whitespace are valid characters in a title. You can adjust what's inside the square braces in the expression to tweak what you do or do not want to consider valid characters in titles.
CodePudding user response:
Try this solution, it gives the exact result you want
rx=re.compile(r'(\d \.){1,3}\d [A-Z|\s|A-Z] (\d )?\n')