If list in a column of Pandas DataFrame-CodePudding

I am trying to verify if the elements of a list are contained in a DataFrame (DF) in Pandas.

This is the code that I've so far:

import pandas as pd
from pathlib import Path

data = pd.read_excel(r'/home/darteagam/diploma/bert/files/codon_positions.xlsx')
df = pd.DataFrame(data,columns=['position','codon','aminoacid'])
print("DataFrame Loaded!")
#print(df)

# reading the files

with open("/home/darteagam/diploma/bert/files/bert_aa_example.txt", "r") as f1, open("/home/darteagam/diploma/bert/files/bert_nn_example.txt", "r") as f2:
    #with open("/home/darteagam/diploma/bert/files/bert_nn_example.txt", "r") as f2:
    print('AA in 31 position:')
    for aa in f1:
        prot_seq = list(aa)
        lp = len(prot_seq)
        position_aa = prot_seq[30:31]
        #print(prot_seq)
        position_aa = list(aa[30:31]) # verifiying the 31 position
        print(position_aa)
        #print(len(position_aa))
        #print(aa)
    #print('Nucleotide sequences')
    for nn in f2:
        nuc_seq = nn
        #print(nuc_seq)
        x=3 
        spl=[nuc_seq[y-x:y] for y in range(x, len(nuc_seq) x,x)]
        pos_cod = spl[30:31]
        list_codons = (list(pos_cod))
        print(list_codons)
        #print(len(list_codons))
        #print(spl)

Output of list:

['ATC']
['AAC']
['ACC']
['TTT']
['GTC']
['CTC']

Output of DF:

         position codon aminoacid
0          1   GCT         A
1          2   GCC         A
2          3   GCA         A
3          4  GCG          A
4          5   CGT         R
..       ...   ...       ...
56        57   TAC         Y
57        58  GTT          V
58        59  GTC          V
59        60  GTA          V
60        61   GTG         V

I'd like to verify if the list in the output it's contained in the column codon of DF and get the position of this element in the DF.

CodePudding user response：

First of all, currently your "output" that you've presented seems to be a sequence of prints to standard out. It would be ideal to have a list like ['ATC','AAC','ACC','TTT','GTC','CTC'].

Concretely, I suspect the following change to your second loop would produce such a list.

    # <first for loop>...
    #
    codon_list = []
    for nuc_seq in f2:
        x=3 
        spl=[nuc_seq[y-x:y] for y in range(x, len(nuc_seq) x,x)]
        
        # pos_cod in your code is a list containing only spl[30]
        codon_list.append(spl[30])

Once you have this list, you could do something like what I've done below. The script below is self-contained, which is to say that it can be run with a simple copy and paste.

import pandas as pd

# generate example dataframe
df = pd.DataFrame({'position': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 56: 57, 57: 58, 58: 59, 59: 60, 60: 61}, 
                   'codon': {0: 'GCT', 1: 'GCC', 2: 'GCA', 3: 'GCG', 4: 'CGT', 56: 'TAC', 57: 'GTT', 58: 'GTC', 59: 'GTA', 60: 'GTG'}, 
                   'aminoacid': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'R', 56: 'Y', 57: 'V', 58: 'V', 59: 'V', 60: 'V'}})

# example list of codons
codon_list = ['TTT','GTC','CTC','GCC']

for c in codon_list:
    where = (df['codon'] == c)
    if where.any():
        pos = df.at[where.idxmax(), 'position']
        print(f"codon {c}: position {pos}")
    else:
        print(f"codon {c} not found")

example dataframe that's generated:

    position codon aminoacid
0          1   GCT         A
1          2   GCC         A
2          3   GCA         A
3          4   GCG         A
4          5   CGT         R
56        57   TAC         Y
57        58   GTT         V
58        59   GTC         V
59        60   GTA         V
60        61   GTG         V

Resulting prints from script:

codon TTT not found
codon GTC: position 59
codon CTC not found
codon GCC: position 2

If speed is a concern, I suspect that replacing pos = df.at[where.idxmax(), 'position'] with pos = df.loc[where,'position'].reset_index(drop=True).iat[0] would speed things up.