find every substring between two repeated strings python-CodePudding

I have a big string like:

I am Dr. Maynard Rubach, test1 president of Cavalier. I must insist test1 that in common decency we all test2 refrain from personal references. test2 Mr. Civek has done his best to give you an explanation, test1 but of course he is a test2 layman and, while he has many excellent test2 qualities, we cannot expect him to be conversant with test1 the principles of science.

need to find every substring between test1,test2, even test1,test2 themselves.

I used this code but gives me only two matches!

pattern = re.compile('test1. ?test2',flags=re.IGNORECASE re.DOTALL)
for s in re.finditer(pattern, bigstring):
    print(s.group())

output:

test1 but of course he is a test2 layman and, while he has many excellent test2
test1 but of course he is a test2

Explanation: I need to get strings like:

test1 that in common decency we all test2

test1 president of Cavalier. I must insist test1 that in common decency we all test2

test1 but of course he is a test2 layman and, while he has many excellent test2

also, and so on... is there another way, even not using regex?

CodePudding user response：

I don't think regex by itself is the right tool for this use case. You are better off finding indices of the matched strings and then slicing.

Below we use regex to find the all of the start positions (indices) for 'test1' and 'test2', then we iterate over those indices, making sure to remove the indices of matches 'test2' that occur before matches to 'test1' and slice the original string to get the substrings.

phrase = (
    "I am Dr. Maynard Rubach, test1 president of Cavalier. I must insist test1 that "
    "in common decency we all test2 refrain from personal references. test2 Mr. Civek "
    "has done his best to give you an explanation, test1 but of course he is a test2 layman "
    "and, while he has many excellent test2 qualities, we cannot expect him to be conversant "
    "with test1 the principles of science."
)

k1 = 'test1'
k2 = 'test2'

indices1 = [m.start() for m in re.finditer(k1, phrase)]
indices2 = [m.start() for m in re.finditer(k2, phrase)]

substrings = []
for ix1 in indices1:
    for ix2 in filter(lambda x: x > ix1, indices1):
        substrings.append(phrase[ix1:(ix2 len(k2))])

print(substrings)
# prints:
['test1 president of Cavalier. I must insist test1 that in common decency we all test2', 
 'test1 president of Cavalier. I must insist test1 that in common decency we all test2 refrain from personal references. test2', 
 'test1 president of Cavalier. I must insist test1 that in common decency we all test2 refrain from personal references. test2 Mr. Civek has done his best to give you an explanation, test1 but of course he is a test2', 
 'test1 president of Cavalier. I must insist test1 that in common decency we all test2 refrain from personal references. test2 Mr. Civek has done his best to give you an explanation, test1 but of course he is a test2 layman and, while he has many excellent test2', 
 'test1 that in common decency we all test2', 
 'test1 that in common decency we all test2 refrain from personal references. test2', 
 'test1 that in common decency we all test2 refrain from personal references. test2 Mr. Civek has done his best to give you an explanation, test1 but of course he is a test2', 
 'test1 that in common decency we all test2 refrain from personal references. test2 Mr. Civek has done his best to give you an explanation, test1 but of course he is a test2 layman and, while he has many excellent test2', 
 'test1 but of course he is a test2', 
 'test1 but of course he is a test2 layman and, while he has many excellent test2']

CodePudding user response：

There are more elegant solutions already proposed but here's another approach:

from itertools import product, takewhile

s = "I am Dr. Maynard Rubach, test1 president of Cavalier. I must insist test1 that in common decency we all test2 refrain from personal references. test2 Mr. Civek has done his best to give you an explanation, test1 but of course he is a test2 layman and, while he has many excellent test2 qualities, we cannot expect him to be conversant with test1 the principles of science."

start = 'test1'
end = 'test2'

def offsets(s, test):
    t = []
    offset = 0
    while (i := s.find(test, offset)) >= 0:
        t.append(i)
        offset = i   1
    return t

for i, j in takewhile(lambda x : x[0] < x[1], product(offsets(s, start), offsets(s, end))):
    print(s[i:j len(end)])

The idea here being that we find the offsets of the start and end tokens ('test1', 'test2') then output strings based on the Cartesian product of the two lists whenever the start index is less than the end index.

CodePudding user response：

If you want to find all substrings that start with "test1" and end with the next "test2", you could do:

import re

# get start and stop of test1/test2 words
matches = [(m.group(0), m.start(), m.end())
           for m in re.finditer('test1|test2', text)]

# iterate over matches to find the test1 and next test2
out = []
for i, m in enumerate(matches):
    if m[0] == 'test1':
        m2 = next(filter(lambda x: x[0]=='test2', matches[i 1:]), None)
        if m2:
            out.append(text[m[1]:m2[2]])

output:

['test1 president of Cavalier. I must insist test1 that in common decency we all test2',
 'test1 that in common decency we all test2',
 'test1 but of course he is a test2']

CodePudding user response：

An alternative approach without using regular expressions is to define simple generator function which will return every substring between given start and stop:

def find_between(source, start, end):
    start_index = -len(start)
    while (start_index := source.find(start, start_index   len(start))) >= 0:
        end_index = start_index - len(end)
        while (end_index := source.find(end, end_index   len(end))) >= 0:
            yield source[start_index: end_index   len(end)]

Usage:

print(*find_between(bigstring, "test1", "test2"), sep="\n")

You can help my country, check my profile info.