Home > front end >  Python Regex - Select Text between a pattern
Python Regex - Select Text between a pattern

Time:10-06

Have below sample text descriptions

Input Example 1: (Have Comments in the text)

1. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLIED AND ACCESSIBLE - Comments: AN HANDWASHING SINK INSTALLED AT EAST BAR AREA.NEED TO RELOCATE HANDWASHING SINK AT MIDDLE OF AREA BEHIND THE BAR COUNTER FOR ACCESSIBILITY TO HAND WASHING. PRIORITY FOUNDATION VIOLATION :7-38-030(C), NO CITATION ISSUED.

Example 2: (No Comments in the text)

47. FOOD & NON-FOOD CONTACT SURFACES CLEANABLE, PROPERLY DESIGNED, CONSTRUCTED & USED

The Output should be like below. Need to extract all texts after starting index numbers,followed by '.' and before '- Comments:'. If no '- Comments:' found then extract all texts after the starting index numbers.

ADEQUATE HANDWASHING SINKS PROPERLY SUPPLIED AND ACCESSIBLE

FOOD & NON-FOOD CONTACT SURFACES CLEANABLE, PROPERLY DESIGNED, CONSTRUCTED & USED

Tried using the regular expression '(?:^[\d . ]*)(.*?)(?:\-\s\w*: ?)' which worked for example 1 but not for example 2.

Is it possible to match both examples in one regular expression?

CodePudding user response:

Here is one regex find all approach. We can match on the following regex:

\d \. (.*?)(?: - |\r?\n|$)

This will match from the start of a numbered section until reaching either -, CR?LF, or the end of the input.

inp = """1. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLIED AND ACCESSIBLE - Comments: AN HANDWASHING SINK INSTALLED AT EAST BAR AREA.NEED TO RELOCATE HANDWASHING SINK AT MIDDLE OF AREA BEHIND THE BAR COUNTER FOR ACCESSIBILITY TO HAND WASHING. PRIORITY FOUNDATION VIOLATION :7-38-030(C), NO CITATION ISSUED.

Example 2: (No Comments in the text)

47. FOOD & NON-FOOD CONTACT SURFACES CLEANABLE, PROPERLY DESIGNED, CONSTRUCTED & USED"""

matches = re.findall(r'\d \. (.*?)(?: - |\r?\n|$)', inp)
print(matches)

This prints:

['ADEQUATE HANDWASHING SINKS PROPERLY SUPPLIED AND ACCESSIBLE',
 'FOOD & NON-FOOD CONTACT SURFACES CLEANABLE, PROPERLY DESIGNED, CONSTRUCTED & USED']
  • Related