Home > database >  how to extract text between 2 regex?
how to extract text between 2 regex?

Time:10-16

I have the following text:

7.1 CAPITAL TITLE

text here
text here

7.2.2 CAPITAL TITLE

i want to get the following results:

7.1 CAPITAL TITLE text here text here

i tried to use regx as follows:

text = re.findall(r'[7]\.\d\s[A-Z] (?:\s [A-Z] )*\b(.*?)([7]\.\d\s[A-Z] (?:\s [A-Z] )*\b|[7]\.\d\.\d\s[A-Z] (?:\s [A-Z] )*\b|[7]\.\d\.\d\.\d\s[A-Z] (?:\s [A-Z] )*\b)', txt_extraction)
print(text)

the idea is to find the 7.X and the next title which could be 7.X or 7.X.X or 7.X.X.X and only then extract the title and the description of that title.
[7]\.\d\s[A-Z] (?:\s [A-Z] )*\b would find the 7.X CAPITAL LETTERS TITLES
I thought (.*?) would find at least the text in between the titles, then i can build up from there but I got stuck :/
any help please?

CodePudding user response:

You might use a pattern to match the title with only capital chars, and then match all lines that do not start with a digit followed by a dot

^7\.\d[^\S\n] [A-Z] (?:[^\S\n] [A-Z] )*(?:\n(?!\d\.).*)*

Explanation

  • ^ Start of string
  • 7\.\d Match 7 . and a digit (use \d for 1 or more digits)
  • [^\S\n] [A-Z] Match 1 spaces without newlines and then 1 uppercase chars A-Z
  • (?:[^\S\n] [A-Z] )* Optionally repeat the previous pattern
  • (?: Non capture group
    • \n Match a newline
    • (?!\d\.).* Match the rest of the line if it does not start with a digit followed by a dot
  • ) Close the non capture group

Regex demo

CodePudding user response:

string = '''
 7.1 CAPITAL TITLE

text here
text here

7.2.2 CAPITAL TITLE

text here
text here


'''
new_string = ''
for line in string.split('\n'):
      if line.isupper():
          new_string ='\n' line
     else:
          new_string =' ' line
      

print(new_string)
  • Related