text:
- MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38.
Question:
The sections of the text include sections 3, 5, 25, and 38 (followed by starting index). I want to extract all texts from one section after '- Comments:' and before the starting index of the next section.
def comments(x):
result = []
for elem in df['Violations']:
matches = re.findall(r'\d \. (.*?)(?: - |\r?\n|$)', elem)
result.extend(matches)
print(result)
The attached code is doing the totally opposite extraction which only extracts the words before '- Comments:', how can I change it?
Many thanks
CodePudding user response:
If you want text between Comments:
and |
then use these values in regex.
'Comments: ([^\|]*) \|'
It uses ()
to catch only chars between Comments:
and |
but different then |
(see [^\|]
).
Because |
has special meaning in regex so I use \|
to use it as normal char in text.
Or
'Comments: (.*?) \|'
which uses ?
to get chars different then |
import re
elem = '''MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38.'''
#matches = re.findall('Comments: ([^\|]*) \|', elem)
matches = re.findall('Comments: (.*?) \|', elem)
#print(matches)
for item in matches:
print(item)
print('---')
Result:
NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.
---
NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.
---
MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.
CodePudding user response:
Your pattern captures as least as possible text in a group before either -
, a newline or end of string and does not match any part with Comments:
You could change it by matching comments, and add a capture group for the text after it
\d \. .*?(?: - Comments:\s*)(.*?)(?: \||$)
A bit more precise match could be matching the start of each text, which is digits, a dot and a space, and then match until the first occurrence of -Comments: without crossing the start of another text.
That after Comments, you can use a capture group to capture until the next occurrence of a text, or assert the end of the string if it is the last one.
Using re.findall will return the value of capture group 1.
\b\d \. (?:(?!\d \. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)
The pattern matches:
\b
A word boundary to prevent a partial word match\d \.
Match 1 digits, a dot and space(?:(?!\d \. |- Comments:).)*
Match any char if directly to the right there is not pattern\d \.
or- Comments
- Comments:\s*
Match- Comments:
followed by optional whitespace chars(.*?)
Capture group 1, match any char as least as possible(?: \||$)
Match either
Example
import re
regex = r"\b\d \. (?:(?!\d \. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)"
s = "3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38. "
print(re.findall(regex, s))
Output
[
'NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.',
'NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.',
'MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. '
]