Home > Back-end >  How can I get a substring from a string in python using a subset of string?
How can I get a substring from a string in python using a subset of string?

Time:04-23

Lets say that I have this string:

a = 'ashfafhkiojojojhohkhgiobbboddbbgoifbafjgibibfoobfbobobfbafnongokhofgoon'

My goal is to create a function that get me any substrings that start with 'af' and end with 'kh'. In this example, I would get 2 substring

  • 'afhkiojojojhohkh' and 'afjgibibfoobfbobobfbafnongokh'

I would also like to get the length of these substrings and their location within the larger string.

I have thought about using a for loop but I did not get very far. Any help is very much appreciated.

Thanks.

CodePudding user response:

Using the build-in module re for regular expressions:

import re

text = 'ashfafhkiojojojhohkhgiobbboddbbgoifbafjgibibfoobfbobobfbafnongokhofgoon'

# tuples of the form (substr, (start, end), length)
matches = [(match.group(0), match.span(), int.__rsub__(*match.span()),) for match in re.finditer(r'(af.*?kh)', text)]

longest = max(matches, key=lambda pairs: pairs[-1])

print(matches)
print(longest)

EDIT

if := is supported the terms in the list comprehension can be simplified like this

(match.group(0), pos:=match.span(), int.__rsub__(*pos)) 

CodePudding user response:

You can use nested searches looking for the start and end:

A full function with dynamic start and end (you can change start and end values) would look like:

def find(inp, start, end):
    ls = len(start)
    le = len(end)
    start_and_len = []

    for i in range(len(inp)-ls 1):
        if inp[i:i ls] == start:
            for j in range(i, len(inp)-le 1):
                if inp[j:j le] == end:
                    # (str, start index, len)
                    start_and_len.append((inp[i:j le], i, j le-i,))

    return start_and_len
# Use as
>>> a = 'afafaf---khkhkh'
>>> find(a, 'af', 'kh')
[('afafaf---kh', 0, 11),
 ('afafaf---khkh', 0, 13),
 ('afafaf---khkhkh', 0, 15),
 ('afaf---kh', 2, 9),
 ('afaf---khkh', 2, 11),
 ('afaf---khkhkh', 2, 13),
 ('af---kh', 4, 7),
 ('af---khkh', 4, 9),
 ('af---khkhkh', 4, 11)]

# Your given example, with more matches
>>> a = 'ashfafhkiojojojhohkhgiobbboddbbgoifbafjgibibfoobfbobobfbafnongokhofgoon'
>>> find(a, 'af', 'kh')
[('afhkiojojojhohkh', 4, 16),
 ('afhkiojojojhohkhgiobbboddbbgoifbafjgibibfoobfbobobfbafnongokh', 4, 61),
 ('afjgibibfoobfbobobfbafnongokh', 36, 29),
 ('afnongokh', 56, 9)]
  • Related