I have a pandas dataframe that contains DNA sequences and gene names. I want to translate the DNA sequences into protein sequences, and store the protein sequences in a new column.
The data frame looks like:
DNA | gene_name |
---|---|
ATGGATAAG | gene_1 |
ATGCAGGAT | gene_2 |
After translating and storing the DNA, the dataframe would look like:
DNA | gene_name | protein |
---|---|---|
ATGGATAAG... | gene_1 | MDK... |
ATGCAGGAT... | gene_2 | MQD... |
I am aware of biopython's (https://biopython.org/wiki/Seq) ability to translate DNA to protein, for example:
>>> from Bio.Seq import Seq
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*')
However, I am not sure how to implement this in the context of a dataframe. Any help would be much appreciated!
CodePudding user response:
I would suggest using pandas.DataFrame.apply.
Something like:
df['protein'] = df['DNA'].apply(lambda x: Seq(x).translate(), axis=1)
CodePudding user response:
Since you want to translate each sequence in the "DNA" column, you could use a list comprehension:
df['protein'] = [''.join(Seq(sq).translate()) for sq in df['DNA']]
Output:
DNA gene_name protein
0 ATGGATAAG gene_1 MDK
1 ATGCAGGAT gene_2 MQD