sample file:
Column header 95: A|T|E|A|A|Y|E|A|E|A
Column header 96: W|I|Q|Q|A|L|P|K|E|A
Column header 97: S|D|F|Q|G|Y|E|A|E|A
I would like to calculate the percentage of amino acid composition for each column from csv file. I'm able to calculate only for first column, but unable to iterate over the remaining columns and print the percentage for all columns.
import csv
with open ('test.csv', 'r') as f:
reader = csv.reader(f)
column = [row[0] for row in reader]
amino_acids = {}
for aa in column:
if aa in amino_acids:
amino_acids[aa] = 1
else:
amino_acids[aa] = 1
for aa, count in amino_acids.items():
#print(f'{aa}: {count}')
percentage = count / len (column) *100
print (f"{aa}: {percentage: .2f}%")
Expected output:
column header 95:
A=50%
E=30% and so on
similarly for the remaining columns.
Please suggest
CodePudding user response:
Not clear what way your input is, but you can apply following code on each row,
Code:
s = 'A|T|E|A|A|Y|E|A|E|A'.split('|')
['{}={}%'.format(i, ls.count(i)/len(ls)*100) for i in set(ls)]
Output:
['T=10.0%', 'A=50.0%', 'E=30.0%', 'Y=10.0%']
CodePudding user response:
Process use basic Python file read since not a CSV file
Code
with open('test.csv', 'r') as f:
for line in f:
line = line.rstrip().split(':') # remove trailing '\' and split on ':'
column_info, sequence = line # separate into colum info and amino acid sequence
sequence = sequence.strip().split('|') # remove leading & trailing whitesplace and split on '|'
amino_acids = {} # Get count of each amino acid
for aa in sequence:
amino_acids[aa] = amino_acids.get(aa, 0) 1
total = sum(count for count in amino_acids.values()) # total of all counts
# sort count by amino acids (not necessary, but better for displaying)
amino_acids = dict(sorted(amino_acids.items(), key = lambda kv: kv[0]))
print(column_info)
# Output percentages
for aa, count in amino_acids.items():
percentage = count / total *100
print (f"{aa}={percentage: .2f}%")
Output
Column header 95
A= 50.00%
E= 30.00%
T= 10.00%
Y= 10.00%
Column header 96
A= 20.00%
E= 10.00%
I= 10.00%
K= 10.00%
L= 10.00%
P= 10.00%
Q= 20.00%
W= 10.00%
Column header 97
A= 20.00%
D= 10.00%
E= 20.00%
F= 10.00%
G= 10.00%
Q= 10.00%
S= 10.00%
Y= 10.00%