Regex to extract texts from the titles and store it in different arrays, by grouping the titles into-CodePudding

I am completely new to regex and would appreciate if someone could help me out here. :)

I have an input text that consists of headings followed by few lines. I wish to group the headings and corresponding content that comes under each heading in 2 separate arrays (or as 2 columns in a dataframe).

Example:

the input text :

Inclusion Criteria for all fruit lovers:

extract this line 2

extract this line 3 as well

Exclusion Criteria for all fruit lovers:

extract this exclusion line 2

extract this exclusion line 3 as well

Inclusion Criteria for apple lovers:

extract this line

extract this line as well

Exclusion Criteria for apple lovers:

extract this line

extract this line as well

the inclusion criteria for both apple and orange lovers

extract this exclusion line 2

extract this exclusion line 3 as well

the exclusion criteria for both apple and orange lovers

extract this exclusion line 2

extract this exclusion line 3 as well

desired output : all the content that comes under inclusion criteria keyword in the title should be grouped together under Inclusion Criteria, similarly all the content that comes user keyword exclusion criteria in title should come under Exclusion Criteria

[Inclusion Criteria : extract this line 2 extract this line 3 as well ... ... .. ]

[Exclusion Criteria: extract this exclusion line 2 extract this exclusion line 3 as well ..... .... ..]

Regex I tried forming: Inclusion Criteria\s*(.*?)\s*Exclusion Criteria|Inclusion Criteria\s*(.*)(\n\n).*$

CodePudding user response：

Not the best solution but will do for your case(not regex)

data = '''Inclusion Criteria for all fruit lovers:
extract this line 2
extract this line 3 as well
Exclusion Criteria for all fruit lovers:
extract this exclusion line 2
extract this exclusion line 3 as well
Inclusion Criteria for apple lovers:
extract this line
extract this line as well
Exclusion Criteria for apple lovers:
extract this line
extract this line as well
the inclusion criteria for both apple and orange lovers
extract this exclusion line 2
extract this exclusion line 3 as well
the exclusion criteria for both apple and orange lovers
extract this exclusion line 2
extract this exclusion line 3 as well'''
newline_split = data.split('\n')
space_removal = [i for i in newline_split if i.strip()]
keywords = ['Inclusion Criteria', 'Exclusion Criteria', 'inclusion criteria', 
'exclusion criteria']
get_index_inclusion_exclusion = [space_removal.index(i) for i in space_removal 
if any((j in i) for j in keywords)]
start_index = get_index_inclusion_exclusion[0::2] # inclusion index
stop_index = get_index_inclusion_exclusion[1::2] # exclusion index
inclusion_line = []
exclusion_line = []
for i in range(len(start_index)):
   text = space_removal[start_index[i]   1:stop_index[i]]
   for j in text:
      inclusion_line.append(j)
   try:
      exclusion_text = space_removal[stop_index[i]   1:start_index[i   1]]
   except IndexError:
      exclusion_text = space_removal[stop_index[i]   1:]
   for k in exclusion_text:
      exclusion_line.append(k)

print(f'Inclusion Criteria :{inclusion_line}')
print(f'Exclusion Criteria :{exclusion_line}')

CodePudding user response：

If you want to use a pattern, you can use 3 capture groups, and in capture group 1 and 2 match either In or Ex clusion to deternmine the difference.

In capture group 3, you can match all lines that belong to that block.

^.*\b(?:([Ii]n)|([Ee]x))clusion [Cc]riteria\b.*((?:\n(?!.*\b(?:[Ii]n|[Ee]x)clusion [Cc]riteria\b).*)*)

Explanation

^ Start of string
.*\b Match the whole line and then a word boundary
(?: Non capture group
- ([Ii]n)|([Ee]x) Capture In in group 2, or Ex in group 3
) Close the non capture group
clusion [Cc]riteria\b Match clusion and the word Criteria
.* Match the rest of the line
( Capture group 3
- (?: Non capture group to repeat as a whole
  - \n Match a newline
  - (?!.*\b(?:[Ii]n|[Ee]x)clusion [Cc]riteria\b) Assert that the line does not contain the exclusion criteria part
  - .* Match the whole line
- )* Close and optionally repeat the non capture group
) Close group 3

See a regex demo with the capture group values.

Capturing the lines in 2 different lists for example:

import re
import pprint
pattern = r"^.*\b(?:([Ii]n)|([Ee]x))clusion [Cc]riteria\b.*((?:\n(?!.*\b(?:[Ii]n|[Ee]x)clusion [Cc]riteria\b).*)*)"

s = ("Inclusion Criteria for all fruit lovers:\n\n"
            "extract this inclusion line\n\n"
            "extract this inclusion line as well\n\n"
            "Exclusion Criteria for all fruit lovers:\n\n"
            "extract this exclusion line 2\n\n"
            "extract this exclusion line 3 as well\n\n"
            "the inclusion criteria for both apple and orange lovers\n\n"
            "extract this exclusion line 2\n\n"
            "extract this exclusion line 3 as well\n\n"
            "the exclusion criteria for both apple and orange lovers\n\n"
            "extract this exclusion line 2\n\n"
            "extract this exclusion line 3 as well")
matches = re.finditer(pattern, s, re.MULTILINE)

inclusion_criteria = []
exclusion_criteria = []

for matchNum, match in enumerate(matches, start=1):
    if match.group(1):
        inclusion_criteria.append(match.group(3))
    if match.group(2):
        exclusion_criteria.append(match.group(3))

print("Inclusion Criteria")
pprint.pprint([s.strip() for s in inclusion_criteria if s])
print("Exclusion Criteria")
pprint.pprint([s.strip() for s in exclusion_criteria if s])

Output

Inclusion Criteria
['extract this inclusion line\n\nextract this inclusion line as well',
 'extract this exclusion line 2\n\nextract this exclusion line 3 as well']
Exclusion Criteria
['extract this exclusion line 2\n\nextract this exclusion line 3 as well',
 'extract this exclusion line 2\n\nextract this exclusion line 3 as well']