Home > Mobile >  Finding substring between string A and string B but ignoring certain string A's
Finding substring between string A and string B but ignoring certain string A's

Time:12-31

I have a block of text in a .txt file that reads:

OVERVIEW OF FINDINGS

--datatext1

OVERVIEW OF FINDINGS

--datatext2

OVERVIEW OF FINDINGS

--datatext3

SUMMARY OF FINDINGS

OVERVIEW OF FINDINGS can happen a random number of times, or only once. I am ONLY interested in datatext3 (a variable amount of text). That is, only the text that lies between the last occurrence of "OVERVIEW OF FINDINGS" and "SUMMARY OF FINDINGS".

There are a few posts on how to use re and how to split strings to get the right text. From them I was able to find a solution that works below. However, it's multiple for loops and an if/elif append loop. It seems extremely convoluted, and I'm wondering if I am overlooking a far simpler solution?

#Index all occurrences of OVERVIEW OF FINDINGS and SUMMARY OF FINDINGS:
    x = []
    y = []
    for i in re.finditer('OVERVIEW OF FINDINGS', data):
        x.append(i.start())
    for j in re.finditer('SUMMARY OF FINDINGS', data):
        y.append(j.start())

#Append to overview only when the next overview index is after the next summary index    
    n = 0
    overview = []
    for m in range(0,len(x)):
        if x[m] == x[-1]: #condition for last value in x or if only one value in x
            overview.append(data[x[m] 21:y[n]]) #(Note: OVERVIEW OF FINDINGS =  21)
        elif x[m 1] > y[n]:
            overview.append(data[x[m] 21:y[n]])
            if y[-1] == y[n]:
                break
            else:
                n  = 1

CodePudding user response:

Regular expressions aren't necessary here; just split the string on the substrings you're looking for.

start = 'OVERVIEW OF FINDINGS'
end = 'SUMMARY OF FINDINGS'
result = text.split(start)[-1].split(end)[0].strip()

CodePudding user response:

If you're open to using regular expressions, we can use re.findall here:

inp = """OVERVIEW OF FINDINGS

--datatext1

OVERVIEW OF FINDINGS

--datatext2

OVERVIEW OF FINDINGS

--datatext3

SUMMARY OF FINDINGS"""

text = re.findall(r'\bOVERVIEW OF FINDINGS\b(?!.*\bOVERVIEW OF FINDINGS\b)\s*(\S )\s SUMMARY OF FINDINGS', inp, flags=re.S)[0]
print(text)  # --datatext3

The regex pattern uses a negative lookahead to assert that the OVERVIEW OF FINDINGS only matches if it be the last one in the entire text. Here is an explanation of the regex pattern:

\bOVERVIEW OF FINDINGS\b        match 'OVERVIEW...'
(?!.*\bOVERVIEW OF FINDINGS\b)  assert that no more 'OVERVIEW...' occurs
\s*                             optional whitespace
(\S )                           match content
\s                              match whitespace
SUMMARY OF FINDINGS             match 'SUMMARY...'
  • Related