I want to know how to sample indices from a list with some constraints like below:
1111
01234567890123
a = "ABAAABABBBABAB"
b = "ABABBB"
The problem is how to find the ordered sampled characters of a
that are equal to b
so, the answer of problem should be like below,
[0, 1, 2, 5, 7, 8],
[0, 1, 2, 5, 7, 9],
[0, 1, 2, 5, 7, 11],
...,
[2, 5, 6, 7, 8, 9],
...
Because b
is equal to a[0] a[1] a[2] a[5] a[7] a[8]
.
So [0, 1, 2, 5, 7, 8]
is one answer of the problem.
But the indices of answer should be satisfied with ordered conditions.
It can not be [2, 1, 3, 5, 7, 8]
.
a[2] a[1] a[3] a[5] a[7] a[8]
is equal to b
, but it can not be satisfied with the ordered state condition.
the length of a
is 22 and the length of b
is always smaller than length of a
.
I implemented brute force to solve this one, but It takes large time complexity.
Is there better algorithm than brute force? If there is, how should I implement it?
CodePudding user response:
If I understood correctly this is a global sequence alignment problem, which you can solve using BioPython:
from Bio import pairwise2
a = "ABAAABABBBABAB"
b = "ABABBB"
for alignment in pairwise2.align.globalxx(a, b):
positions = [i for i, v in enumerate(alignment.seqB) if v != "-"]
print(positions)
Output (partial)
[0, 1, 6, 7, 8, 13]
[0, 1, 4, 7, 8, 13]
[0, 1, 3, 7, 8, 13]
[0, 1, 2, 7, 8, 13]
[0, 1, 4, 5, 8, 13]
[0, 1, 3, 5, 8, 13]
[0, 1, 2, 5, 8, 13]
[0, 1, 4, 5, 7, 13]
[0, 1, 3, 5, 7, 13]
[0, 1, 2, 5, 7, 13]
CodePudding user response:
Cannot be done in constant time, but can be done with roughly linear time complexity (at least, with respect to A
- I think the upper bound of complexity of this is O(A * B)
) as follows:
def indices(A, B):
# store partial patterns and complete patterns as we go
# partial patterns will be stored in a list, where each element
# is a 2-tuple (next_char_of_B_to_look_for, current_index_list)
partial_patterns = []
complete_patterns = []
for (idx, char) in enumerate(A):
# look for the next element, for any partial patterns we have
# if we find it, then mark the index and increment the char to look for
for p in partial_patterns:
if p[0] < len(B) and char == B[p[0]]:
p[1].append(idx)
p[0] = 1
# p[0] == len(B) indicates a completed pattern, and also
# will stop this `if` from being triggered again for this
# now-completed pattern (which will be ignored hereafter)
if p[0] == len(B):
complete_patterns.append(p[1])
# always look for the first element of B to start a new partial pattern
if char == B[0]:
partial_patterns.append([1, [idx]])
return complete_patterns
a = "ABAAABABBBABAB"
b = "ABABBB"
indices(a, b)
# [[0, 1, 2, 5, 7, 8],
# [2, 5, 6, 7, 8, 9],
# [3, 5, 6, 7, 8, 9],
# [4, 5, 6, 7, 8, 9]]
I believe this is the correct output for the given input (as index 6 is the last A
with at least three B
s following it)
CodePudding user response:
Build a table similar to that of Longest common subsequence problem, then extract all LCS's using section Reading out all LCSs (here LCS is B sequence)
CodePudding user response:
Simple solution would be:
a = "ABAAABABBBABAB"
b = "ABABBB"
total_res = []
def find_indices(index):
b_index = 0
res = []
for c in range(index, len(a)):
if b_index > len(b) - 1:
break
if a[c] == b[b_index]:
res.append(c)
b_index = 1
if res not in total_res and res:
print(res)
total_res.append(res)
for c in range(len(a)):
find_indices(c)
Output:
[0, 1, 2, 5, 7, 8]
[2, 5, 6, 7, 8, 9]
[3, 5, 6, 7, 8, 9]
[4, 5, 6, 7, 8, 9]
[6, 7, 10, 11, 13]
[10, 11, 12, 13]
[12, 13]