list of maximal common connected substrings of two strings-CodePudding

Suppose we have two strings "abcdefgh" and "abudesh". I want for solution to be a list ["ab", "de", "h"]. So, I want a list of maximally connected substrings that are the same for both strings. Is this has a name and what would be a good approach in solving it?

Edit: I need to say that the order is not important in the way that if we have, for example, two strings "abcdefg" and "defkabc", the result is ["abc", "def"].

CodePudding user response：

Using:

Biopython's pairwise2 to align the two sequences;
itertools.groupby to group the "maximally connected substrings".

from Bio import pairwise2
from itertools import groupby

def maxConnectedSubstrings(strA, strB):
    alignment = pairwise2.align.globalxx(strA, strB)[0]
    grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])
    return [''.join(ca for ca,cb in g) for k,g in grouped if k]

print( maxConnectedSubstrings('abcdefgh', 'abudesh') )
# ['ab', 'de', 'h']

Explanation

First, we align the sequences. The result of alignment = pairwise2.align.globalxx(strA, strB)[0] is:

alignment.seqA = 'abcdefgh'
alignment.seqB = 'abude-sh'

The alignment algorithm found the best way to add '-' in the sequences to align them.

Then, we use groupby on zip(alignment.seqA, alignment.seqB). The zip(...) is a sequence of pairs (character from seqA, character from seqB). We group these pairs with the key lambda p: p[0] == p[1], which gives the following result:

grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])

grouped = [
    (True,  [('a', 'a'),
             ('b', 'b')]),
    (False, [('c', 'u')]),
    (True,  [('d', 'd'),
             ('e', 'e')]),
    (False, [('f', '-'),
             ('g', 's')]),
    (True,  [('h', 'h')])
]

Finally, we discard the False groups, and we join the letters of every True group.

CodePudding user response：

I can offer the solution of your problem, created in C .

#include <iostream>
#include <vector>
#include <string>
using namespace std;
void showContentVectorString(vector<string>& input)
{
    for(int i=0; i<input.size();   i)
    {
        cout<<input[i]<<", ";
    }
    return;
}
void findEqualParts(string string0, string string1)
{
    string temporary;
    vector<string> strings;
    for(int i=0; i<string0.size();   i)
    {
        for(int j=0; j<string1.size();   j)
        {
            if(string0[i]==string1[j])
            {
                temporary.push_back(string0[i]);
                  i;
                if(j==string1.size()-1)
                {
                    strings.push_back(temporary);
                    temporary.clear();
                    --i;
                }
            }
            else
            {
                if(temporary.empty()==false)
                {
                    strings.push_back(temporary);
                }
                temporary.clear();
            }
        }
    }
    cout<<"strings <- ";
    showContentVectorString(strings);
    return;
}
void solve()
{
    findEqualParts("abcdefgh", "abudesh");
    cout<<endl;
    findEqualParts("abcdefg", "defkabc");
    cout<<endl;
    return;
}
int main() 
{
    solve();
    return 0;
}

Here is the result:

strings <- ab, de, h, 
strings <- abc, def,

If you will need an explanation of the solution, write the corresponding comment.