Home > OS >  Efficient way to find a substring in a string, but the superstring contains brackets
Efficient way to find a substring in a string, but the superstring contains brackets

Time:10-20

I am doing constituency parsing. I have a larger string, and this string contains brackets.

Example: "The [quick brown fox] [jumps over] [the lazy dog]."

I have a smaller substring, but this string does not contain brackets:

"quick brown fox jumps over"

I want to find the version of the smaller substring in the larger substrings, but with the brackets included:

i.e return "[quick brown fox][jumps over]"

This is my implementation, but it gets very slow when the substring is entire sentences and the larger string is entire passages (i.e 100,000s of chars long).

def get_span(smaller_sentence, larger_sentence):
    for start in range(len(larger_sentence)):
        subsentence= larger_sentence[start:]
        without_brackets=remove_brackets(subsentence)
        if(without_brackets.startswith(smaller_sentence)):
            for end in range(start, len(larger_sentence) 1):
                without_brackets=remove_brackets(larger_sentence[start:end])
                if(without_brackets==smaller_sentence):
                    return larger_sentence[start:end]
    raise ValueError(f"Span not found {smaller_sentence, larger_sentence}")

is there a more efficient way to do things?

CodePudding user response:

One approach that can achieve this in linear time complexity would be to create lists of indices at which left brackets and right brackets should be inserted in the larger sentence that's stripped of the brackets.

Then, find the indices of the leftmost left bracket and of the rightmost right bracket within where the smaller sentence is found in the larger sentence (use the bisect module to be find them more efficiently).

Once you have have the indices of the brackets, slice the stripped larger sentence according to the indices and alternately insert left and right brackets to assemble the final output:

from bisect import bisect_left, bisect_right
from itertools import chain, pairwise

def get_span(smaller_sentence, larger_sentence):
    stripped = larger_sentence.translate(str.maketrans('', '', '[]'))
    if (left := stripped.find(smaller_sentence)) == -1:
        raise ValueError(f"Span not found {smaller_sentence, larger_sentence}")
    right = left   len(smaller_sentence)
    count = last_index = 0
    lefts, rights = [], []
    while True:
        for bracket, indices in zip('[]', (lefts, rights)):
            index = larger_sentence.find(bracket, last_index)
            if index == -1:
                break
            indices.append(index - count)
            count  = 1
            last_index = index   1
        else:
            continue
        break
    start, end = bisect_left(lefts, left), bisect_right(rights, right)
    return ''.join(
        fragment for (fragment_start, fragment_end), bracket in zip(
            pairwise((
                left,
                *chain.from_iterable(zip(lefts[start:end], rights[start:end])),
                right
            )),
            (*'[]' * (end - start), '')
        )
        for fragment in (stripped[fragment_start:fragment_end], bracket)
    )

so that:

get_span(
    "quick brown fox jumps over",
    "The [quick brown fox] [jumps over] [the lazy dog]."
)

returns:

[quick brown fox] [jumps over]

Demo: https://www.mycompiler.io/view/7ILLCtFKWNG

  • Related