Python: Sampling indices from a list with constraints-CodePudding

I want to know how to sample indices from a list with some constraints like below:

               1111
     01234567890123
a = "ABAAABABBBABAB"
b = "ABABBB"

The problem is how to find the ordered sampled characters of a that are equal to b

so, the answer of problem should be like below,

[0, 1, 2, 5, 7, 8],
[0, 1, 2, 5, 7, 9],
[0, 1, 2, 5, 7, 11],
...,
[2, 5, 6, 7, 8, 9],
...

Because b is equal to a[0] a[1] a[2] a[5] a[7] a[8]. So [0, 1, 2, 5, 7, 8] is one answer of the problem. But the indices of answer should be satisfied with ordered conditions. It can not be [2, 1, 3, 5, 7, 8]. a[2] a[1] a[3] a[5] a[7] a[8] is equal to b, but it can not be satisfied with the ordered state condition.

the length of a is 22 and the length of b is always smaller than length of a.

I implemented brute force to solve this one, but It takes large time complexity.

Is there better algorithm than brute force? If there is, how should I implement it?

CodePudding user response：

If I understood correctly this is a global sequence alignment problem, which you can solve using BioPython:

from Bio import pairwise2

a = "ABAAABABBBABAB"
b = "ABABBB"

for alignment in pairwise2.align.globalxx(a, b):
    positions = [i for i, v in enumerate(alignment.seqB) if v != "-"]
    print(positions)

Output (partial)

[0, 1, 6, 7, 8, 13]
[0, 1, 4, 7, 8, 13]
[0, 1, 3, 7, 8, 13]
[0, 1, 2, 7, 8, 13]
[0, 1, 4, 5, 8, 13]
[0, 1, 3, 5, 8, 13]
[0, 1, 2, 5, 8, 13]
[0, 1, 4, 5, 7, 13]
[0, 1, 3, 5, 7, 13]
[0, 1, 2, 5, 7, 13]

CodePudding user response：

Cannot be done in constant time, but can be done with roughly linear time complexity (at least, with respect to A - I think the upper bound of complexity of this is O(A * B)) as follows:

def indices(A, B):
    # store partial patterns and complete patterns as we go
    # partial patterns will be stored in a list, where each element
    # is a 2-tuple (next_char_of_B_to_look_for, current_index_list)
    partial_patterns = []
    complete_patterns = []
    for (idx, char) in enumerate(A):
        # look for the next element, for any partial patterns we have
        # if we find it, then mark the index and increment the char to look for
        for p in partial_patterns:
            if p[0] < len(B) and char == B[p[0]]:
                p[1].append(idx)
                p[0]  = 1
                # p[0] == len(B) indicates a completed pattern, and also 
                # will stop this `if` from being triggered again for this 
                # now-completed pattern (which will be ignored hereafter)
                if p[0] == len(B):
                    complete_patterns.append(p[1])
        # always look for the first element of B to start a new partial pattern
        if char == B[0]:
            partial_patterns.append([1, [idx]])
    return complete_patterns

a = "ABAAABABBBABAB"
b = "ABABBB"
indices(a, b)
# [[0, 1, 2, 5, 7, 8], 
#  [2, 5, 6, 7, 8, 9], 
#  [3, 5, 6, 7, 8, 9], 
#  [4, 5, 6, 7, 8, 9]]

I believe this is the correct output for the given input (as index 6 is the last A with at least three Bs following it)

CodePudding user response：

Build a table similar to that of Longest common subsequence problem, then extract all LCS's using section Reading out all LCSs (here LCS is B sequence)

CodePudding user response：

Simple solution would be:

a = "ABAAABABBBABAB"
b = "ABABBB"

total_res = []
def find_indices(index):
    b_index = 0
    res = []
    for c in range(index, len(a)):
        if b_index > len(b) - 1:
            break
        if a[c] == b[b_index]:
            res.append(c)
            b_index  = 1
    if res not in total_res and res:
        print(res)
        total_res.append(res)
for c in range(len(a)):
    find_indices(c)

Output:

[0, 1, 2, 5, 7, 8]
[2, 5, 6, 7, 8, 9]
[3, 5, 6, 7, 8, 9]
[4, 5, 6, 7, 8, 9]
[6, 7, 10, 11, 13]
[10, 11, 12, 13]
[12, 13]