Suppose we have two strings "abcdefgh" and "abudesh". I want for solution to be a list ["ab", "de", "h"]. So, I want a list of maximally connected substrings that are the same for both strings. Is this has a name and what would be a good approach in solving it?
Edit: I need to say that the order is not important in the way that if we have, for example, two strings "abcdefg" and "defkabc", the result is ["abc", "def"].
CodePudding user response:
Using:
- Biopython's pairwise2 to align the two sequences;
- itertools.groupby to group the "maximally connected substrings".
from Bio import pairwise2
from itertools import groupby
def maxConnectedSubstrings(strA, strB):
alignment = pairwise2.align.globalxx(strA, strB)[0]
grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])
return [''.join(ca for ca,cb in g) for k,g in grouped if k]
print( maxConnectedSubstrings('abcdefgh', 'abudesh') )
# ['ab', 'de', 'h']
Explanation
First, we align the sequences. The result of alignment = pairwise2.align.globalxx(strA, strB)[0]
is:
alignment.seqA = 'abcdefgh'
alignment.seqB = 'abude-sh'
The alignment algorithm found the best way to add '-'
in the sequences to align them.
Then, we use groupby
on zip(alignment.seqA, alignment.seqB)
. The zip(...)
is a sequence of pairs (character from seqA, character from seqB). We group these pairs with the key lambda p: p[0] == p[1]
, which gives the following result:
grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])
grouped = [
(True, [('a', 'a'),
('b', 'b')]),
(False, [('c', 'u')]),
(True, [('d', 'd'),
('e', 'e')]),
(False, [('f', '-'),
('g', 's')]),
(True, [('h', 'h')])
]
Finally, we discard the False groups, and we join the letters of every True group.
CodePudding user response:
I can offer the solution of your problem, created in C .
#include <iostream>
#include <vector>
#include <string>
using namespace std;
void showContentVectorString(vector<string>& input)
{
for(int i=0; i<input.size(); i)
{
cout<<input[i]<<", ";
}
return;
}
void findEqualParts(string string0, string string1)
{
string temporary;
vector<string> strings;
for(int i=0; i<string0.size(); i)
{
for(int j=0; j<string1.size(); j)
{
if(string0[i]==string1[j])
{
temporary.push_back(string0[i]);
i;
if(j==string1.size()-1)
{
strings.push_back(temporary);
temporary.clear();
--i;
}
}
else
{
if(temporary.empty()==false)
{
strings.push_back(temporary);
}
temporary.clear();
}
}
}
cout<<"strings <- ";
showContentVectorString(strings);
return;
}
void solve()
{
findEqualParts("abcdefgh", "abudesh");
cout<<endl;
findEqualParts("abcdefg", "defkabc");
cout<<endl;
return;
}
int main()
{
solve();
return 0;
}
Here is the result:
strings <- ab, de, h,
strings <- abc, def,
If you will need an explanation of the solution, write the corresponding comment.