I have this long string. I want to target the company data from it. I want to target a block that's starting with capital letters (which is the company name) and ends on License Type/Action including the text after the :.
DISTRICT ROW LLC
Premises No.: 0 License Key: 0 Date Entered: 09/08/2021
Tradename: OLSEN RUN WINERY Date Received: 09/03/2021
Address: 32900 DIAMOND HILL DR, HARRISBURG 97446
Email Address: [email protected]
License Type/Action: F-COM / N/O
WILCOX PIZZA LLC
Premises No.: 0 License Key: 0 Date Entered: 09/08/2021
Tradename: FIGARO'S PIZZA Date Received: 09/02/2021
Address: 4095 NW LOGAN RD STE B, LINCOLN CITY 97367
Email Address: [email protected]
License Type/Action: O / N/O
The output should look like this
['DISTRICT ROW LLC
Premises No.: 0 License Key: 0 Date Entered: 09/08/2021
Tradename: OLSEN RUN WINERY Date Received: 09/03/2021
Address: 32900 DIAMOND HILL DR, HARRISBURG 97446
Email Address: [email protected]
License Type/Action: F-COM / N/O'],
['WILCOX PIZZA LLC
Premises No.: 0 License Key: 0 Date Entered: 09/08/2021
Tradename: FIGARO'S PIZZA Date Received: 09/02/2021
Address: 4095 NW LOGAN RD STE B, LINCOLN CITY 97367
Email Address: [email protected]
License Type/Action: O / N/O']
What would be the regular expression for this?
CodePudding user response:
Here is an re.findall
approach which seems to be working:
parts = re.findall(r'\b[A-Z] (?: [A-Z] )*.*?License Type/Action: [^\r\n] ', inp, flags=re.DOTALL)
print(parts)
This prints:
['DISTRICT ROW LLC\n\n Premises No.: 0 License Key: 0 Date Entered: 09/08/2021\n\n Tradename: OLSEN RUN WINERY Date Received: 09/03/2021\n\n Address: 32900 DIAMOND HILL DR, HARRISBURG 97446\n\n Email Address: [email protected]\n\nLicense Type/Action: F-COM / N/O',
"WILCOX PIZZA LLC\n\n Premises No.: 0 License Key: 0 Date Entered: 09/08/2021\n\n Tradename: FIGARO'S PIZZA Date Received: 09/02/2021\n\n Address: 4095 NW LOGAN RD STE B, LINCOLN CITY 97367\n\n Email Address: [email protected]\n\nLicense Type/Action: O / N/O"]
Here is an explanation of the regex pattern:
\b[A-Z] match first word of uppercase company name
(?: [A-Z] )* space followed by more company word names
.*? match all content, across newlines
License Type/Action: until reaching "License Type/Action:"
[ ] single space
[^\r\n] match the remainder of the final line