Home > Blockchain >  If list in a column of Pandas DataFrame
If list in a column of Pandas DataFrame

Time:11-14

I am trying to verify if the elements of a list are contained in a DataFrame (DF) in Pandas.

This is the code that I've so far:

import pandas as pd
from pathlib import Path

data = pd.read_excel(r'/home/darteagam/diploma/bert/files/codon_positions.xlsx')
df = pd.DataFrame(data,columns=['position','codon','aminoacid'])
print("DataFrame Loaded!")
#print(df)

# reading the files

with open("/home/darteagam/diploma/bert/files/bert_aa_example.txt", "r") as f1, open("/home/darteagam/diploma/bert/files/bert_nn_example.txt", "r") as f2:
    #with open("/home/darteagam/diploma/bert/files/bert_nn_example.txt", "r") as f2:
    print('AA in 31 position:')
    for aa in f1:
        prot_seq = list(aa)
        lp = len(prot_seq)
        position_aa = prot_seq[30:31]
        #print(prot_seq)
        position_aa = list(aa[30:31]) # verifiying the 31 position
        print(position_aa)
        #print(len(position_aa))
        #print(aa)
    #print('Nucleotide sequences')
    for nn in f2:
        nuc_seq = nn
        #print(nuc_seq)
        x=3 
        spl=[nuc_seq[y-x:y] for y in range(x, len(nuc_seq) x,x)]
        pos_cod = spl[30:31]
        list_codons = (list(pos_cod))
        print(list_codons)
        #print(len(list_codons))
        #print(spl)

Output of list:

['ATC']
['AAC']
['ACC']
['TTT']
['GTC']
['CTC']

Output of DF:

         position codon aminoacid
0          1   GCT         A
1          2   GCC         A
2          3   GCA         A
3          4  GCG          A
4          5   CGT         R
..       ...   ...       ...
56        57   TAC         Y
57        58  GTT          V
58        59  GTC          V
59        60  GTA          V
60        61   GTG         V

I'd like to verify if the list in the output it's contained in the column codon of DF and get the position of this element in the DF.

CodePudding user response:

First of all, currently your "output" that you've presented seems to be a sequence of prints to standard out. It would be ideal to have a list like ['ATC','AAC','ACC','TTT','GTC','CTC'].

Concretely, I suspect the following change to your second loop would produce such a list.

    # <first for loop>...
    #
    codon_list = []
    for nuc_seq in f2:
        x=3 
        spl=[nuc_seq[y-x:y] for y in range(x, len(nuc_seq) x,x)]
        
        # pos_cod in your code is a list containing only spl[30]
        codon_list.append(spl[30]) 

Once you have this list, you could do something like what I've done below. The script below is self-contained, which is to say that it can be run with a simple copy and paste.

import pandas as pd

# generate example dataframe
df = pd.DataFrame({'position': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 56: 57, 57: 58, 58: 59, 59: 60, 60: 61}, 
                   'codon': {0: 'GCT', 1: 'GCC', 2: 'GCA', 3: 'GCG', 4: 'CGT', 56: 'TAC', 57: 'GTT', 58: 'GTC', 59: 'GTA', 60: 'GTG'}, 
                   'aminoacid': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'R', 56: 'Y', 57: 'V', 58: 'V', 59: 'V', 60: 'V'}})

# example list of codons
codon_list = ['TTT','GTC','CTC','GCC']

for c in codon_list:
    where = (df['codon'] == c)
    if where.any():
        pos = df.at[where.idxmax(), 'position']
        print(f"codon {c}: position {pos}")
    else:
        print(f"codon {c} not found")

example dataframe that's generated:

    position codon aminoacid
0          1   GCT         A
1          2   GCC         A
2          3   GCA         A
3          4   GCG         A
4          5   CGT         R
56        57   TAC         Y
57        58   GTT         V
58        59   GTC         V
59        60   GTA         V
60        61   GTG         V

Resulting prints from script:

codon TTT not found
codon GTC: position 59
codon CTC not found
codon GCC: position 2

If speed is a concern, I suspect that replacing pos = df.at[where.idxmax(), 'position'] with pos = df.loc[where,'position'].reset_index(drop=True).iat[0] would speed things up.

  • Related