Using R, I'd like to write a "reverse translation" function that takes as input a string of amino acids and for each one returns a triplet codon that encodes that amino acid. The codon would be randomly chosen from a frequency table that I will initially hard-code the values for, but I would also like to eventually be able to adjust the frequencies so that "rarer" codons will be more likely to be chosen.
So far I have tried to define a "codon" class using the 61 codons that encode an amino acid in the standard genetic code.
# Define codon class
codon <- function(sequence, aminoacid, frequency) {
new_codon <- list(sequence = sequence, aminoacid = aminoacid, frequency = frequency)
class(new_codon) <- "codon"
new_codon
}
# Define the 61 codon frequency table
# structure: nucleotide sequence, amino acid it encodes, frequency codon is used
codons <- list(
# Alanine (Ala) codons
codon("GCT", "A", 0.26),
codon("GCC", "A", 0.40),
codon("GCA", "A", 0.23),
codon("GCG", "A", 0.11),
# Cysteine (Cys) codons
codon("TGT", "C", 0.45),
codon("TGC", "C", 0.55),
# Aspartic acid (Asp) codons
codon("GAT", "D", 0.46),
codon("GAC", "D", 0.54),
# Glutamic acid (Glu) codons
codon("GAA", "E", 0.42),
codon("GAG", "E", 0.58),
# Phenylalanine (Phe) codons
codon("TTT", "F", 0.45),
codon("TTC", "F", 0.55),
# Glycine (Gly) codons
codon("GGT", "G", 0.16),
codon("GGC", "G", 0.34),
codon("GGA", "G", 0.25),
codon("GGG", "G", 0.25),
# Histidine (His) codons
codon("CAT", "H", 0.41),
codon("CAC", "H", 0.59),
# Isoleucine (Ile) codons
codon("ATT", "I", 0.36),
codon("ATC", "I", 0.48),
codon("ATA", "I", 0.16),
# Lysine (Lys) codons
codon("AAA", "K", 0.42),
codon("AAG", "K", 0.58),
# Leucine (Leu) codons
codon("TTA", "L", 0.07),
codon("TTG", "L", 0.13),
codon("CTT", "L", 0.13),
codon("CTC", "L", 0.20),
codon("CTA", "L", 0.07),
codon("CTG", "L", 0.41),
# Methionine (Met) codon
codon("ATG", "M", 1.0),
# Asparagine (Asn) codons
codon("AAT", "N", 0.46),
codon("AAC", "N", 0.54),
# Proline (Pro) codons
codon("CCT", "P", 0.28),
codon("CCC", "P", 0.33),
codon("CCA", "P", 0.27),
codon("CCG", "P", 0.11),
# Glutamine (Gln) codons
codon("CAA", "Q", 0.25),
codon("CAG", "Q", 0.75),
# Arginine (Arg) codons
codon("CGT", "R", 0.08),
codon("CGC", "R", 0.19),
codon("CGA", "R", 0.11),
codon("CGG", "R", 0.21),
codon("AGA", "R", 0.20),
codon("AGG", "R", 0.20),
# Serine (Ser) codons
codon("TCT", "S", 0.18),
codon("TCC", "S", 0.22),
codon("TCA", "S", 0.15),
codon("TCG", "S", 0.06),
codon("AGT", "S", 0.15),
codon("AGC", "S", 0.24),
# Threonine (Thr) codons
codon("ACT", "T", 0.24),
codon("ACC", "T", 0.36),
codon("ACA", "T", 0.28),
codon("ACG", "T", 0.12),
# Valine (Val) codons
codon("GTT", "V", 0.18),
codon("GTC", "V", 0.24),
codon("GTA", "V", 0.11),
codon("GTG", "V", 0.47),
# Tryptophan (Trp) codon
codon("TGG", "W", 1.0),
# Tyrosine (Tyr) codons
codon("TAT", "Y", 0.43),
codon("TAC", "Y", 0.57)
)
The function I tried below:
# Define function to generate new sequence
random_sequence <- function(protein_sequence) {
# Initialize empty list to store new sequence
new_sequence <- list()
# Iterate over each amino acid in the protein sequence
for (aa in protein_sequence) {
# Get all codons that encode the current amino acid
aa_codons <- codons[sapply(codons, function(x) x$aminoacid == aa)]
# Get the frequencies of the codons
aa_freqs <- sapply(aa_codons, function(x) x$frequency)
# Normalize the frequencies to sum to 1
aa_freqs <- aa_freqs / sum(aa_freqs)
# Select a random codon based on the frequencies
new_codon <- aa_codons[sample(1:length(aa_codons), 1, prob = aa_freqs)]
# Add the new codon to the new sequence
new_sequence <- c(new_sequence, new_codon[[1]][1]$sequence)
}
# Return the new sequence
new_sequence <- paste(new_sequence, collapse = "")
return(new_sequence)
}
# Example usage
protein_sequence <- "MAVVDLKECEILHTWVFPKKKHGEARNDDCQQGKPST"
random_sequence(protein_sequence)
When I run each line of this function individually I get the results I expect. Also, if I set the protein_sequence to be one letter long, it also works. However, when I try to run the function as written, I get the error message:
Error in sum(aa_freqs) : invalid 'type' (list) of argument
I tried to fix this by removing the normalize frequencies to sum to 1 safety check (# aa_freqs <- aa_freqs / sum(aa_freqs)
. But now I get a different error:
Error in sample.int(length(x), size, replace, prob) : incorrect number of probabilities
I am not sure how to troubleshoot the bug given that when I run through the function line by line with a one letter test amino acid, it all seems to work fine.
Any assistance or insight would be greatly appreciated! Thank you very much for your help!
CodePudding user response:
It looks like your function is trying to parse the whole amino acid sequence at once, rather than letter-by-letter. You can use strsplit
or equivalent to split the passed string into component parts and then more or less do as you have indicated:
ReverseTranscribe <- function(protein_sequence) {
paste(sapply(unlist(strsplit(protein_sequence, "")), function(cur_amino) {
# Extract possible codons
potential_matches <- codons[sapply(codons, function(x) {x$aminoacid == cur_amino})]
# Convert to a weights table
weights_table <- data.frame(matrix(sapply(potential_matches, unlist), ncol = 3, byrow = TRUE))
# Weighted sample
chosen_codon <- sample(weights_table[[1]], 1, prob = weights_table[[3]])
return(chosen_codon)
}), collapse = "-")
}
example <- "MAVVDLKECEILHTWVFPKKKHGEARNDDCQQGKPST"
> for (i in 1:3) {
print(ReverseTranscribe(example))
}
[1] "ATG-GCT-GTC-GTG-GAT-CTC-AAG-GAA-TGT-GAG-ATC-CTG-CAT-ACC-TGG-GTT-TTC-CCC-AAG-AAA-AAG-CAC-GGT-GAG-GCT-CGG-AAT-GAC-GAT-TGC-CAG-CAG-GGC-AAG-CCT-AGC-ACC"
[1] "ATG-GCC-GTG-GTG-GAT-CTG-AAG-GAG-TGC-GAG-ATC-CTG-CAC-ACC-TGG-GTC-TTT-CCT-AAG-AAG-AAA-CAT-GGC-GAG-GCC-CGT-AAT-GAC-GAC-TGC-CAA-CAA-GGG-AAG-CCA-TCA-ACA"
[1] "ATG-GCC-GTG-GTG-GAT-TTG-AAA-GAG-TGC-GAG-ATC-CTG-CAC-ACC-TGG-GTT-TTC-CCC-AAG-AAG-AAG-CAC-GGA-GAG-GCT-CGC-AAC-GAT-GAT-TGT-CAG-CAG-GGC-AAG-CCC-TCT-ACC"