Im doing a course in bioinformatics. We were supposed to create a function that takes a list of strings like this:
Motifs =[
"AACGTA",
"CCCGTT",
"CACCTT",
"GGATTA",
"TTCCGG"]
and turn it into a count matrix that counts the occurrence of the nucleotides (the letters A, C, G and T) in each column and adds a pseudocount 1 to it, represented by a dictionary with multiple values for each key like this:
count ={
'A': [2, 3, 2, 1, 1, 3],
'C': [3, 2, 5, 3, 1, 1],
'G': [2, 2, 1, 3, 2, 2],
'T': [2, 2, 1, 2, 5, 3]}
For example A occurs 1 1 pseudocount = 2 in the first column. C appears 2 1 pseudocount = 3 in the fourth column.
Here is my solution:
def CountWithPseudocounts(Motifs):
t = len(Motifs)
k = len(Motifs[0])
count = {}
for symbol in "ACGT":
count[symbol] = [1 for j in range(k)]
for i in range(t):
for j in range(k):
symbol = Motifs[i][j]
count[symbol][j] = 1
return count
The first set of for loops generates a dictionary with the keys A,C,G,T and the initial values 1 for each column like this:
count ={
'A': [1, 1, 1, 1, 1, 1],
'C': [1, 1, 1, 1, 1, 1],
'G': [1, 1, 1, 1, 1, 1],
'T': [1, 1, 1, 1, 1, 1]}
The second set of for loops counts the occurrence of the nucleotides and adds it to the values of the existing dictionary as seen above.
This works and does its job, but I want to know how to further compress both for loops using dict comprehensions.
NOTE: I am fully aware that there are a multitude of modules and libraries like biopython, scipy and numpy that probably can turn the entire function into a one liner. The problem with modules is that their output format often doesnt match with what the automated solution check from the course is expecting.
CodePudding user response:
This
count = {}
for symbol in "ACGT":
count[symbol] = [1 for j in range(k)]
can be changed to comprehension as follows
count = {symbol:[1 for j in range(k)] for symbol in "ACGT"}
and then further simplified by using python
s ability to multiply list by integer to
count = {symbol:[1]*k for symbol in "ACGT"}
CodePudding user response:
compressing the first loop:
count = {symbol: [1 for j in range(k)] for symbol in "ACGT"}
This method is called a generator (or dict comprehension) - it generates a dict
using a for
loop.
I'm not sure you can compress the second (nested) loop, since it's not generating anything, but changing the first dict.
CodePudding user response:
You can compress a lot your code using collections.Counter
and collections.defaultdict
:
from collections import Counter, defaultdict
out = defaultdict(list)
bases = 'ACGT'
for m in zip(*Motifs):
c = Counter(m)
for b in bases:
out[b].append(c[b] 1)
dict(out)
output:
{'A': [2, 3, 2, 1, 1, 3],
'C': [3, 2, 5, 3, 1, 1],
'G': [2, 2, 1, 3, 2, 2],
'T': [2, 2, 1, 2, 5, 3]}
CodePudding user response:
You can use collections.Counter
:
from collections import Counter
m = ['AACGTA', 'CCCGTT', 'CACCTT', 'GGATTA', 'TTCCGG']
d = [Counter(i) for i in zip(*m)]
r = {a:[j.get(a, 0) 1 for j in d] for a in 'ACGT'}
Output:
{'A': [2, 3, 2, 1, 1, 3], 'C': [3, 2, 5, 3, 1, 1], 'G': [2, 2, 1, 3, 2, 2], 'T': [2, 2, 1, 2, 5, 3]}