python regex - select words after pattern-CodePudding

text:

MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38.

Question:

The sections of the text include sections 3, 5, 25, and 38 (followed by starting index). I want to extract all texts from one section after '- Comments:' and before the starting index of the next section.

def comments(x):
    result = []
    for elem in df['Violations']:
        matches = re.findall(r'\d \. (.*?)(?: - |\r?\n|$)', elem)
        result.extend(matches)
    print(result)

The attached code is doing the totally opposite extraction which only extracts the words before '- Comments:', how can I change it?

Many thanks

CodePudding user response：

If you want text between Comments: and | then use these values in regex.

'Comments: ([^\|]*) \|'

It uses () to catch only chars between Comments: and | but different then | (see [^\|]).

Because | has special meaning in regex so I use \| to use it as normal char in text.

'Comments: (.*?) \|'

which uses ? to get chars different then |

import re

elem = '''MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38.'''

#matches = re.findall('Comments: ([^\|]*) \|', elem)
matches = re.findall('Comments: (.*?) \|', elem)

#print(matches)

for item in matches:
    print(item)
    print('---')

Result:

NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.
---
NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.
---
MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.

CodePudding user response：

Your pattern captures as least as possible text in a group before either - , a newline or end of string and does not match any part with Comments:

You could change it by matching comments, and add a capture group for the text after it

\d \. .*?(?: - Comments:\s*)(.*?)(?: \||$)

Regex demo

A bit more precise match could be matching the start of each text, which is digits, a dot and a space, and then match until the first occurrence of -Comments: without crossing the start of another text.

That after Comments, you can use a capture group to capture until the next occurrence of a text, or assert the end of the string if it is the last one.

Using re.findall will return the value of capture group 1.

\b\d \. (?:(?!\d \. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)

The pattern matches:

\b A word boundary to prevent a partial word match
\d \. Match 1 digits, a dot and space
(?:(?!\d \. |- Comments:).)* Match any char if directly to the right there is not pattern \d \. or - Comments
- Comments:\s* Match - Comments: followed by optional whitespace chars
(.*?) Capture group 1, match any char as least as possible
(?: \||$) Match either

Regex demo | Python demo

Example

import re

regex = r"\b\d \. (?:(?!\d \. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)"

s = "3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.  | 38. "

print(re.findall(regex, s))

Output

[
'NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.', 
'NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.', 
'MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. '
]