Calculating the percentage of amino acid composition for each column in csv-CodePudding

sample file:

Column header 95: A|T|E|A|A|Y|E|A|E|A
Column header 96: W|I|Q|Q|A|L|P|K|E|A
Column header 97: S|D|F|Q|G|Y|E|A|E|A

I would like to calculate the percentage of amino acid composition for each column from csv file. I'm able to calculate only for first column, but unable to iterate over the remaining columns and print the percentage for all columns.

import csv
with open ('test.csv', 'r') as f:
    reader = csv.reader(f)
    column = [row[0] for row in reader]
    amino_acids = {}
    for aa in column:
        if aa in amino_acids:
            amino_acids[aa]  = 1
        else:
            amino_acids[aa] = 1
    for aa, count in amino_acids.items():
        #print(f'{aa}: {count}')
        percentage = count / len (column) *100
        print (f"{aa}: {percentage: .2f}%")

Expected output:

column header 95:
A=50%
E=30% and so on
similarly for the remaining columns.

Please suggest

CodePudding user response：

Not clear what way your input is, but you can apply following code on each row,

Code:

s = 'A|T|E|A|A|Y|E|A|E|A'.split('|')
['{}={}%'.format(i, ls.count(i)/len(ls)*100) for i in set(ls)]

Output:

['T=10.0%', 'A=50.0%', 'E=30.0%', 'Y=10.0%']

CodePudding user response：

Process use basic Python file read since not a CSV file

Code

with open('test.csv', 'r') as f:
    for line in f:
        line = line.rstrip().split(':')         # remove trailing '\' and split on ':'
        column_info, sequence = line            # separate into colum info and amino acid sequence
        sequence = sequence.strip().split('|')  # remove leading & trailing whitesplace and split on '|'
        amino_acids = {}                        # Get count of each amino acid
        for aa in sequence:
            amino_acids[aa] = amino_acids.get(aa, 0)   1
            
        total = sum(count for count in amino_acids.values())                     # total of all counts
        
        # sort count by amino acids (not necessary, but better for displaying)
        amino_acids = dict(sorted(amino_acids.items(), key = lambda kv: kv[0]))   
                  
        print(column_info)
               
        # Output percentages
        for aa, count in amino_acids.items():
            percentage = count / total *100                            
            print (f"{aa}={percentage: .2f}%")

Output

Column header 95
A= 50.00%
E= 30.00%
T= 10.00%
Y= 10.00%
Column header 96
A= 20.00%
E= 10.00%
I= 10.00%
K= 10.00%
L= 10.00%
P= 10.00%
Q= 20.00%
W= 10.00%
Column header 97
A= 20.00%
D= 10.00%
E= 20.00%
F= 10.00%
G= 10.00%
Q= 10.00%
S= 10.00%
Y= 10.00%