I have a block of text in a .txt file that reads:
OVERVIEW OF FINDINGS
--datatext1
OVERVIEW OF FINDINGS
--datatext2
OVERVIEW OF FINDINGS
--datatext3
SUMMARY OF FINDINGS
OVERVIEW OF FINDINGS can happen a random number of times, or only once. I am ONLY interested in datatext3 (a variable amount of text). That is, only the text that lies between the last occurrence of "OVERVIEW OF FINDINGS" and "SUMMARY OF FINDINGS".
There are a few posts on how to use re
and how to split strings to get the right text. From them I was able to find a solution that works below. However, it's multiple for loops and an if/elif append loop. It seems extremely convoluted, and I'm wondering if I am overlooking a far simpler solution?
#Index all occurrences of OVERVIEW OF FINDINGS and SUMMARY OF FINDINGS:
x = []
y = []
for i in re.finditer('OVERVIEW OF FINDINGS', data):
x.append(i.start())
for j in re.finditer('SUMMARY OF FINDINGS', data):
y.append(j.start())
#Append to overview only when the next overview index is after the next summary index
n = 0
overview = []
for m in range(0,len(x)):
if x[m] == x[-1]: #condition for last value in x or if only one value in x
overview.append(data[x[m] 21:y[n]]) #(Note: OVERVIEW OF FINDINGS = 21)
elif x[m 1] > y[n]:
overview.append(data[x[m] 21:y[n]])
if y[-1] == y[n]:
break
else:
n = 1
CodePudding user response:
Regular expressions aren't necessary here; just split the string on the substrings you're looking for.
start = 'OVERVIEW OF FINDINGS'
end = 'SUMMARY OF FINDINGS'
result = text.split(start)[-1].split(end)[0].strip()
CodePudding user response:
If you're open to using regular expressions, we can use re.findall
here:
inp = """OVERVIEW OF FINDINGS
--datatext1
OVERVIEW OF FINDINGS
--datatext2
OVERVIEW OF FINDINGS
--datatext3
SUMMARY OF FINDINGS"""
text = re.findall(r'\bOVERVIEW OF FINDINGS\b(?!.*\bOVERVIEW OF FINDINGS\b)\s*(\S )\s SUMMARY OF FINDINGS', inp, flags=re.S)[0]
print(text) # --datatext3
The regex pattern uses a negative lookahead to assert that the OVERVIEW OF FINDINGS
only matches if it be the last one in the entire text. Here is an explanation of the regex pattern:
\bOVERVIEW OF FINDINGS\b match 'OVERVIEW...'
(?!.*\bOVERVIEW OF FINDINGS\b) assert that no more 'OVERVIEW...' occurs
\s* optional whitespace
(\S ) match content
\s match whitespace
SUMMARY OF FINDINGS match 'SUMMARY...'