I have the following text:
7.1 CAPITAL TITLE
text here
text here
7.2.2 CAPITAL TITLE
i want to get the following results:
7.1 CAPITAL TITLE text here text here
i tried to use regx as follows:
text = re.findall(r'[7]\.\d\s[A-Z] (?:\s [A-Z] )*\b(.*?)([7]\.\d\s[A-Z] (?:\s [A-Z] )*\b|[7]\.\d\.\d\s[A-Z] (?:\s [A-Z] )*\b|[7]\.\d\.\d\.\d\s[A-Z] (?:\s [A-Z] )*\b)', txt_extraction)
print(text)
the idea is to find the 7.X and the next title which could be 7.X or 7.X.X or 7.X.X.X and only then extract the title and the description of that title.
[7]\.\d\s[A-Z] (?:\s [A-Z] )*\b
would find the 7.X CAPITAL LETTERS TITLES
I thought (.*?)
would find at least the text in between the titles, then i can build up from there but I got stuck :/
any help please?
CodePudding user response:
You might use a pattern to match the title with only capital chars, and then match all lines that do not start with a digit followed by a dot
^7\.\d[^\S\n] [A-Z] (?:[^\S\n] [A-Z] )*(?:\n(?!\d\.).*)*
Explanation
^
Start of string7\.\d
Match 7.
and a digit (use\d
for 1 or more digits)[^\S\n] [A-Z]
Match 1 spaces without newlines and then 1 uppercase chars A-Z(?:[^\S\n] [A-Z] )*
Optionally repeat the previous pattern(?:
Non capture group\n
Match a newline(?!\d\.).*
Match the rest of the line if it does not start with a digit followed by a dot
)
Close the non capture group
CodePudding user response:
string = '''
7.1 CAPITAL TITLE
text here
text here
7.2.2 CAPITAL TITLE
text here
text here
'''
new_string = ''
for line in string.split('\n'):
if line.isupper():
new_string ='\n' line
else:
new_string =' ' line
print(new_string)