I am trying to verify if the elements of a list are contained in a DataFrame (DF) in Pandas.
This is the code that I've so far:
import pandas as pd
from pathlib import Path
data = pd.read_excel(r'/home/darteagam/diploma/bert/files/codon_positions.xlsx')
df = pd.DataFrame(data,columns=['position','codon','aminoacid'])
print("DataFrame Loaded!")
#print(df)
# reading the files
with open("/home/darteagam/diploma/bert/files/bert_aa_example.txt", "r") as f1, open("/home/darteagam/diploma/bert/files/bert_nn_example.txt", "r") as f2:
#with open("/home/darteagam/diploma/bert/files/bert_nn_example.txt", "r") as f2:
print('AA in 31 position:')
for aa in f1:
prot_seq = list(aa)
lp = len(prot_seq)
position_aa = prot_seq[30:31]
#print(prot_seq)
position_aa = list(aa[30:31]) # verifiying the 31 position
print(position_aa)
#print(len(position_aa))
#print(aa)
#print('Nucleotide sequences')
for nn in f2:
nuc_seq = nn
#print(nuc_seq)
x=3
spl=[nuc_seq[y-x:y] for y in range(x, len(nuc_seq) x,x)]
pos_cod = spl[30:31]
list_codons = (list(pos_cod))
print(list_codons)
#print(len(list_codons))
#print(spl)
Output of list:
['ATC']
['AAC']
['ACC']
['TTT']
['GTC']
['CTC']
Output of DF:
position codon aminoacid
0 1 GCT A
1 2 GCC A
2 3 GCA A
3 4 GCG A
4 5 CGT R
.. ... ... ...
56 57 TAC Y
57 58 GTT V
58 59 GTC V
59 60 GTA V
60 61 GTG V
I'd like to verify if the list in the output it's contained in the column codon of DF and get the position of this element in the DF.
CodePudding user response:
First of all, currently your "output" that you've presented seems to be a sequence of prints to standard out. It would be ideal to have a list like ['ATC','AAC','ACC','TTT','GTC','CTC']
.
Concretely, I suspect the following change to your second loop would produce such a list.
# <first for loop>...
#
codon_list = []
for nuc_seq in f2:
x=3
spl=[nuc_seq[y-x:y] for y in range(x, len(nuc_seq) x,x)]
# pos_cod in your code is a list containing only spl[30]
codon_list.append(spl[30])
Once you have this list, you could do something like what I've done below. The script below is self-contained, which is to say that it can be run with a simple copy and paste.
import pandas as pd
# generate example dataframe
df = pd.DataFrame({'position': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 56: 57, 57: 58, 58: 59, 59: 60, 60: 61},
'codon': {0: 'GCT', 1: 'GCC', 2: 'GCA', 3: 'GCG', 4: 'CGT', 56: 'TAC', 57: 'GTT', 58: 'GTC', 59: 'GTA', 60: 'GTG'},
'aminoacid': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'R', 56: 'Y', 57: 'V', 58: 'V', 59: 'V', 60: 'V'}})
# example list of codons
codon_list = ['TTT','GTC','CTC','GCC']
for c in codon_list:
where = (df['codon'] == c)
if where.any():
pos = df.at[where.idxmax(), 'position']
print(f"codon {c}: position {pos}")
else:
print(f"codon {c} not found")
example dataframe that's generated:
position codon aminoacid
0 1 GCT A
1 2 GCC A
2 3 GCA A
3 4 GCG A
4 5 CGT R
56 57 TAC Y
57 58 GTT V
58 59 GTC V
59 60 GTA V
60 61 GTG V
Resulting prints from script:
codon TTT not found
codon GTC: position 59
codon CTC not found
codon GCC: position 2
If speed is a concern, I suspect that replacing pos = df.at[where.idxmax(), 'position']
with pos = df.loc[where,'position'].reset_index(drop=True).iat[0]
would speed things up.