is there a way of comparing the similarities between these two sequences in python?-CodePudding

I am new to Python and I need your help in getting the similarity between two sequences. Assuming they are not of the same length and some may have (-) gap symbols.

So here is my code bellow in getting the similarity in only one sequence.

seq1 = "AAAATCCCTAGGGTCAT"
def similarity(seq1):
    base_dic={}
    for i in range(len(seq1)):
        if seq1[i] in base_dic.keys():
           base_dic[seq1[i]] =1
        else:
            base_dic[seq1[i]]=1
    for key in base_dic.keys():
        base_dic[key]=base_dic[key]/len(seq1)*100
    return base_dic
similarity(seq1)

Output: 
{'A': 35.294117647058826,
 'T': 23.52941176470588,
 'C': 23.52941176470588,
 'G': 17.647058823529413}

My question is how could I modify this code, so that it can take two sequences at a time and find the similarities?
for ex. 
seq1 = "AAAATCCCTAGAAAGGTCAT"
seq2 = “AAGATC---TTTCTACT”

Any ideas? Thanks

i am expecting to get the similarity of A, T, G, C but not -. as they should be counted as unsimilar.

CodePudding user response：

There are 2 parts to this answer.

Polishes to the code in question.
Standard algorithm to find how similar 2 strings are.

Part-1: Polishes to your code in the question.
The key takeaways here for you are to make your code more pythonic

from collections import Counter

seq1 = "AAAATCCCTAGGGTCAT"
seq2 = "AAGATC---TTTCTACT"


def get_count(s):
    H = defaultdict(int)
    n = len(s)
    s = [c for c in s if c in "ATCG"]
    ctr = Counter(s)
    ctr = {k: (v / n) * 100 for k, v in ctr.items()}
    return ctr


def calculate_similarity(s1, s2):
    result1 = get_count(seq1)
    result2 = get_count(seq2)
    print(result1)
    print(result2)
    """
    Add some custom logic here to compute
    how similar the strings are (or)
    similarity of A, T, C & G
    """

calculate_similarity(seq1, seq2)

Part-2: Standard algorithm to find the similarity b/w to strings is given by Levenshtein distance.

The Levenshtein distance is a string metric for measuring the difference between two sequences.
Consider a scenario where you can convert sequence A to sequence B by adding (or) deleting (or) replacing a character. For simplicity, let's assume we have to calculate the number of operations to convert sequence A to B.
This can be computed by the following logic.

def minDistance(word1: str, word2: str) -> int:
    m, n = len(word1), len(word2)
    dp = [[0] * (n   1) for _ in range(m   1)]
    dp[0][0] = 0
    for i in range(m   1):
        dp[i][0] = i
    for j in range(n   1):
        dp[0][j] = j

    for i in range(1, m   1):
        for j in range(1, n   1):
            if word1[i - 1] == word2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                # dp[i-1][j] -> insert
                # dp[i][j-1] -> remove
                # dp[i-1][j-1] -> replace
                dp[i][j] = min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])   1
    return dp[-1][-1]

# The number of operations to convert seq1 to seq2
# This number can help you compare how similar 2 strings are.
result = minDistance(seq1, seq2). 

print(result)

References: