Have below sample text descriptions
Input Example 1: (Have Comments in the text)
1. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLIED AND ACCESSIBLE - Comments: AN HANDWASHING SINK INSTALLED AT EAST BAR AREA.NEED TO RELOCATE HANDWASHING SINK AT MIDDLE OF AREA BEHIND THE BAR COUNTER FOR ACCESSIBILITY TO HAND WASHING. PRIORITY FOUNDATION VIOLATION :7-38-030(C), NO CITATION ISSUED.
Example 2: (No Comments in the text)
47. FOOD & NON-FOOD CONTACT SURFACES CLEANABLE, PROPERLY DESIGNED, CONSTRUCTED & USED
The Output should be like below. Need to extract all texts after starting index numbers,followed by '.' and before '- Comments:'. If no '- Comments:' found then extract all texts after the starting index numbers.
ADEQUATE HANDWASHING SINKS PROPERLY SUPPLIED AND ACCESSIBLE
FOOD & NON-FOOD CONTACT SURFACES CLEANABLE, PROPERLY DESIGNED, CONSTRUCTED & USED
Tried using the regular expression '(?:^[\d . ]*)(.*?)(?:\-\s\w*: ?)'
which worked for example 1 but not for example 2.
Is it possible to match both examples in one regular expression?
CodePudding user response:
Here is one regex find all approach. We can match on the following regex:
\d \. (.*?)(?: - |\r?\n|$)
This will match from the start of a numbered section until reaching either -
, CR?LF, or the end of the input.
inp = """1. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLIED AND ACCESSIBLE - Comments: AN HANDWASHING SINK INSTALLED AT EAST BAR AREA.NEED TO RELOCATE HANDWASHING SINK AT MIDDLE OF AREA BEHIND THE BAR COUNTER FOR ACCESSIBILITY TO HAND WASHING. PRIORITY FOUNDATION VIOLATION :7-38-030(C), NO CITATION ISSUED.
Example 2: (No Comments in the text)
47. FOOD & NON-FOOD CONTACT SURFACES CLEANABLE, PROPERLY DESIGNED, CONSTRUCTED & USED"""
matches = re.findall(r'\d \. (.*?)(?: - |\r?\n|$)', inp)
print(matches)
This prints:
['ADEQUATE HANDWASHING SINKS PROPERLY SUPPLIED AND ACCESSIBLE',
'FOOD & NON-FOOD CONTACT SURFACES CLEANABLE, PROPERLY DESIGNED, CONSTRUCTED & USED']