I am trying to look at the codon usage within the transmembrane domains of certain proteins.
To do this, I have the sequences for the TM domain, and I want to search these sequences for how often certain codons appear (the frequency).
Ideally I would like to add new columns to an existing dataframe with the counts for each codon per gene. Like this hypothetical data:
Gene ID | TM_domain_Seq | AAA | CAC | GGA |
---|---|---|---|---|
ENSG00000003989 | TGGAGCCTCGCTC | 0 | 0 | 1 |
ENSG00000003989 | TGGAGCCTCGCTC | 0 | 0 | 1 |
ENSG00000003989 | TGGAGCCTCGCTC | 0 | 0 | 1 |
ENSG00000003989 | TGGAGCCTCGCTC | 0 | 0 | 1 |
ENSG00000003989 | TGGAGCCTCGCTC | 0 | 0 | 1 |
I have tried the following - creating a function to count how often a particular codon comes up, and applying it to each TM sequence. The problem I am having is how to get a new column added to my data frame for each codon, and how to get the codon frequencies into them.
I have tried for loops, but they take way too long
amino_search <- function(seq) {
count <- str_count(seq, pattern = codons)
return(count)
}
codon_search <- function(TMseq) {
High_cor$Newcol <- unlist(lapply(TMseq, amino_search))
}
Any help would be greatly appreciated. Thank you!
CodePudding user response:
Create the vector of possible combinations, then use str_count
:
comb <- expand.grid(replicate(3, c("A", "T", "G", "C"), simplify = FALSE)) |>
apply(MARGIN = 1, FUN = paste, collapse = "")
#apply(X = _, 1, FUN = paste, collapse = "") #with the new placeholder
df[, comb] <- t(sapply(df$TM_domain_Seq, stringr::str_count, comb))
If you want only in-frame codons, one way to do that is to add a space every three characters:
gsub('(.{3})', '\\1 ', df$TM_domain_Seq[1])
#[1] "TGG AGC CTC GCT C"
df[, comb] <- t(sapply(gsub('(.{3})', '\\1 ', df$TM_domain_Seq), stringr::str_count, comb))
output
# A tibble: 5 × 66
Gene_ID TM_domain_Seq AAA CAC GGA TAA GAA CAA ATA TTA GTA CTA AGA TGA CGA ACA TCA
<chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ENSG00… TGGAGCCTCGCTC 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 ENSG00… TGGAGCCTCGCTC 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
3 ENSG00… TGGAGCCTCGCTC 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
4 ENSG00… TGGAGCCTCGCTC 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
5 ENSG00… TGGAGCCTCGCTC 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
# … with 49 more variables: GCA <int>, CCA <int>, AAT <int>, TAT <int>, GAT <int>, CAT <int>, ATT <int>,
# TTT <int>, GTT <int>, CTT <int>, AGT <int>, TGT <int>, GGT <int>, CGT <int>, ACT <int>, TCT <int>,
# GCT <int>, CCT <int>, AAG <int>, TAG <int>, GAG <int>, CAG <int>, ATG <int>, TTG <int>, GTG <int>,
# CTG <int>, AGG <int>, TGG <int>, GGG <int>, CGG <int>, ACG <int>, TCG <int>, GCG <int>, CCG <int>,
# AAC <int>, TAC <int>, GAC <int>, ATC <int>, TTC <int>, GTC <int>, CTC <int>, AGC <int>, TGC <int>,
# GGC <int>, CGC <int>, ACC <int>, TCC <int>, GCC <int>, CCC <int>
CodePudding user response:
Here's a solution using nested for loop.
The first loop goes over all rows in df
, and the second loop goes over every possible non-overlapping codon.
for (i in 1:nrow(df)) {
for (j in seq(1, nchar(df[i, 2]) - 3, 3)) {
df[i, colnames(df) == substr(df[i, 2], j, j 2)] <- df[i, substr(df[i, 2], j, j 2)] 1
}
}
Update: I've found a solution with the seqinr
package.
library(seqinr)
library(stringr)
codon_list <- do.call(paste0, expand.grid(rep(list(c('A', 'G', 'T', 'C')), 3)))
df[, codon_list] <- t(sapply(1:nrow(df), function(x) uco(c(str_split(df[x, 2], "", simplify = T)), index = "eff")))
CodePudding user response:
Split the problem into sub-problems, solve them individually, and compose the solution.
The first subproblem is: how do I get codon frequencies of a given (in-frame) sequence? The answer is either to use a pre-made solution (e.g. Bioconductor’s Biostrings:: trinucleotideFrequency(…, steps = 3L)
), or something quick and dirty like the following:
codon_frequencies = function (seq) {
# Take care of incomplete codon at end.
len = nchar(seq) - (nchar(seq) %% 3L)
start = seq(1L, len, by = 3L)
substring(seq, start, start 2L) |> table()
}
Try it:
codon_frequencies('TGGAGCCTCGCTC')
#
# AGC CTC GCT TGG
# 1 1 1 1
… incidentally, is it intentional that your sequences have fragmentary codons? If so, are you sure they always start on a full codon?
OK. The next step is calling this function for each gene ID in your table, and collecting the results. At this point, we’re helped by the fact that a counts table can be converted to a tidy data frame:
data.frame(codon_frequencies('TGGAGCCTCGCTC'))
# Var1 Freq
# 1 AGC 1
# 2 CTC 1
# 3 GCT 1
# 4 TGG 1
For our purposes, this is a convenient format, because it makes table manipulation easier (especially when working in tidy data format, which I’m doing in the following using the packages ‘dplyr’, ‘tidyr’ and ‘purrr’):
df |>
group_by(`Gene ID`) |>
summarize(map_dfr(TM_domain_Seq, ~ data.frame(codon_frequencies(.x))))
# # A tibble: 20 × 3
# # Groups: Gene ID [1]
# `Gene ID` Var1 Freq
# <chr> <fct> <int>
# 1 ENSG00000003989 AGC 1
# 2 ENSG00000003989 CTC 1
# 3 ENSG00000003989 GCT 1
# 4 ENSG00000003989 TGG 1
# …
At this point we could probably call it a day: this is a convenient format to work with. However, if you prefer, you can also pivot the data into wide format:
… |>
pivot_wider(
id_cols = `Gene ID`,
names_from = Var1,
values_from = Freq,
values_fill = 0L # Otherwise missing codons will be `NA`
)
# # A tibble: 5 × 11
# # Groups: Gene ID [5]
# `Gene ID` AGC CTC GCA TGG TGA GCG AGT GAT TAC GCT
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 ENSG00000003981 1 1 1 1 0 0 0 0 0 0
# 2 ENSG00000003982 1 1 0 1 1 0 0 0 0 0
# 3 ENSG00000003983 1 1 0 1 0 1 0 0 0 0
# 4 ENSG00000003984 0 0 0 0 0 0 1 1 1 0
# 5 ENSG00000003989 1 1 0 1 0 0 0 0 0 1
(This is using some different toy data.)
Finally, if you want to have columns for all codons, even those not present in your data, you can make a small modification to the codon_frequencies
function:
all_codons = c('A', 'C', 'G', 'T') %>% expand.grid(., ., .) |> apply(1L, paste, collapse = '')
codon_frequencies = function (seq, all = FALSE) {
# Take care of incomplete codon at end.
len = nchar(seq) - (nchar(seq) %% 3L)
start = seq(1L, len, by = 3L)
codons = substring(seq, start, start 2L)
table(if (all) factor(codons, levels = all_codons) else codons)
}
And then call it as codon_frequencies(.x, all = TRUE)
in the code above. The pivot_wider
no longer needs the values_fill = 0L
argument then.
Putting it all together:
df |>
group_by(`Gene ID`) |>
summarize(
map_dfr(TM_domain_Seq, ~ data.frame(codon_frequencies(.x, all = TRUE))),
.groups = 'drop'
) |>
pivot_wider(
id_cols = `Gene ID`,
names_from = Var1,
values_from = Freq
)