Home > other >  access regex capturing groups in python
access regex capturing groups in python

Time:10-04

ptx captures most of what i want. Because i am incompetent at combining many things into one regex) i created a second ptx1 regex that should capture the following character sequences ADDITIONALLY: One Department, One foreign Department, Two office

    text_list = ' '.join(map(str, text))
    ptx = re.compile(r'(\s something(?:\s |\\n)*patternx:)(.*)(One\s foreign)', flags = re.DOTALL | re.MULTILINE)
    ptx1 = re.compile(r'(\s something(?:\s |\\n)*patternx:)(.*)((One|Two)\s (?:foreign\s )*Department|office)', flags = re.DOTALL | re.MULTILINE)
    ten = ptx.search(text_list)
    eleven = ptx1.search(text_list)
    try:
        if ten:
            ten = ten.group(2)
        else:
            ten = None
    except:
        pass

here is what i added before else above: It didnt work.

        elif:
            ten = eleven.group(2)

My question is: How do i need to call the group on the elif statement in order to get the (.*) or text_i_want content returned? I have the gut feeling that i need to access the eleven as if it were a list because it has so many capturing groups by eleven[0].group(1) in order to get first element from the list and get its second group. But that didnt work either.

You can think of text_list like this

text_list = ['...something\npatternx: text_i_want One Department',
'...something patternx: text_i_want One foreign Department',
'...something\n patternx: text_i_want Two office']

CodePudding user response:

It looks as if you got tricked when factoring in the alternatives on the right hand side.

You need to use

\bsomething\s patternx:(.*?)\b(?:One\s foreign|One\s Department|One\s foreign\s Department|Two\s office)\b

which can be shortened as

\bsomething\s patternx:(.*?)\b(?:One\s (?:Department|foreign(?:\s Department)?)|Two\s office)\b

See the regex demo. Details:

  • \bsomething\s patternx: - whole word something, one or more whitespaces, patternx: string
  • (.*?) - Group 1: any zero or more chars as few as possible
  • \b(?:One\s (?:Department|foreign(?:\s Department)?)|Two\s office)\b - either One Department, One foreign, One foreign Department, or Two office as whole words.

See the Python demo:

import re
text_list = [' something\npatternx: text_i_want One Department',' something patternx: text_i_want One foreign Department',' something\n patternx: text_i_want Two office']
text_list = ' '.join(map(str, text_list))
rx = r'\bsomething\s patternx:(.*?)\b(?:One\s (?:Department|foreign(?:\s Department)?)|Two\s office)\b'
print(re.findall(rx, text_list, re.DOTALL))
# => [' text_i_want ', ' text_i_want ', ' text_i_want '] 
  • Related