Home > OS >  Target text between a string and convert to a list python
Target text between a string and convert to a list python

Time:09-21

I have this long string. I want to target the company data from it. I want to target a block that's starting with capital letters (which is the company name) and ends on License Type/Action including the text after the :.

DISTRICT ROW LLC

   Premises No.: 0           License Key:   0                                  Date Entered: 09/08/2021

    Tradename: OLSEN RUN WINERY                                               Date Received: 09/03/2021

        Address: 32900 DIAMOND HILL DR, HARRISBURG 97446

  Email Address: [email protected]

License Type/Action: F-COM / N/O





WILCOX PIZZA LLC

   Premises No.: 0           License Key:   0                                  Date Entered: 09/08/2021

    Tradename: FIGARO'S PIZZA                                                 Date Received: 09/02/2021

        Address: 4095 NW LOGAN RD STE B, LINCOLN CITY 97367

  Email Address: [email protected]

License Type/Action:       O / N/O

The output should look like this

['DISTRICT ROW LLC

   Premises No.: 0           License Key:   0                                  Date Entered: 09/08/2021

    Tradename: OLSEN RUN WINERY                                               Date Received: 09/03/2021

        Address: 32900 DIAMOND HILL DR, HARRISBURG 97446

  Email Address: [email protected]

License Type/Action: F-COM / N/O'],





['WILCOX PIZZA LLC

   Premises No.: 0           License Key:   0                                  Date Entered: 09/08/2021

    Tradename: FIGARO'S PIZZA                                                 Date Received: 09/02/2021

        Address: 4095 NW LOGAN RD STE B, LINCOLN CITY 97367

  Email Address: [email protected]

License Type/Action:       O / N/O']

What would be the regular expression for this?

CodePudding user response:

Here is an re.findall approach which seems to be working:

parts = re.findall(r'\b[A-Z] (?: [A-Z] )*.*?License Type/Action: [^\r\n] ', inp, flags=re.DOTALL)
print(parts)

This prints:

['DISTRICT ROW LLC\n\n   Premises No.: 0           License Key:   0                                  Date Entered: 09/08/2021\n\n    Tradename: OLSEN RUN WINERY                                               Date Received: 09/03/2021\n\n        Address: 32900 DIAMOND HILL DR, HARRISBURG 97446\n\n  Email Address: [email protected]\n\nLicense Type/Action: F-COM / N/O',
 "WILCOX PIZZA LLC\n\n   Premises No.: 0           License Key:   0                                  Date Entered: 09/08/2021\n\n    Tradename: FIGARO'S PIZZA                                                 Date Received: 09/02/2021\n\n        Address: 4095 NW LOGAN RD STE B, LINCOLN CITY 97367\n\n  Email Address: [email protected]\n\nLicense Type/Action:       O / N/O"]

Here is an explanation of the regex pattern:

\b[A-Z]               match first word of uppercase company name
(?: [A-Z] )*          space followed by more company word names
.*?                   match all content, across newlines
License Type/Action:  until reaching "License Type/Action:"
[ ]                   single space
[^\r\n]               match the remainder of the final line
  • Related