Quantifying frequency of codons in a transmembrane sequence

I am trying to look at the codon usage within the transmembrane domains of certain proteins.

To do this, I have the sequences for the TM domain, and I want to search these sequences for how often certain codons appear (the frequency).

Ideally I would like to add new columns to an existing dataframe with the counts for each codon per gene. Like this hypothetical data:

Gene ID	TM_domain_Seq	GGA
ENSG00000003989	TGGAGCCTCGCTC	1
ENSG00000003989	TGGAGCCTCGCTC	1
ENSG00000003989	TGGAGCCTCGCTC	1
ENSG00000003989	TGGAGCCTCGCTC	1
ENSG00000003989	TGGAGCCTCGCTC	1

I have tried the following - creating a function to count how often a particular codon comes up, and applying it to each TM sequence. The problem I am having is how to get a new column added to my data frame for each codon, and how to get the codon frequencies into them.

I have tried for loops, but they take way too long

amino_search <- function(seq) {
  
  count <- str_count(seq, pattern = codons)
  return(count)
}

codon_search <- function(TMseq) {
  
 High_cor$Newcol <- unlist(lapply(TMseq, amino_search))
}

Any help would be greatly appreciated. Thank you!

CodePudding user response：

Create the vector of possible combinations, then use str_count:

comb <- expand.grid(replicate(3, c("A", "T", "G", "C"), simplify = FALSE)) |>
  apply(MARGIN = 1, FUN = paste, collapse = "")
  #apply(X = _, 1, FUN = paste, collapse = "") #with the new placeholder

df[, comb] <- t(sapply(df$TM_domain_Seq, stringr::str_count, comb))

If you want only in-frame codons, one way to do that is to add a space every three characters:

gsub('(.{3})', '\\1 ', df$TM_domain_Seq[1])
#[1] "TGG AGC CTC GCT C"

df[, comb] <- t(sapply(gsub('(.{3})', '\\1 ', df$TM_domain_Seq), stringr::str_count, comb))

output

# A tibble: 5 × 66
  Gene_ID TM_domain_Seq   AAA   CAC   GGA   TAA   GAA   CAA   ATA   TTA   GTA   CTA   AGA   TGA   CGA   ACA   TCA
  <chr>   <chr>         <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
2 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
3 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
4 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
5 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
# … with 49 more variables: GCA <int>, CCA <int>, AAT <int>, TAT <int>, GAT <int>, CAT <int>, ATT <int>,
#   TTT <int>, GTT <int>, CTT <int>, AGT <int>, TGT <int>, GGT <int>, CGT <int>, ACT <int>, TCT <int>,
#   GCT <int>, CCT <int>, AAG <int>, TAG <int>, GAG <int>, CAG <int>, ATG <int>, TTG <int>, GTG <int>,
#   CTG <int>, AGG <int>, TGG <int>, GGG <int>, CGG <int>, ACG <int>, TCG <int>, GCG <int>, CCG <int>,
#   AAC <int>, TAC <int>, GAC <int>, ATC <int>, TTC <int>, GTC <int>, CTC <int>, AGC <int>, TGC <int>,
#   GGC <int>, CGC <int>, ACC <int>, TCC <int>, GCC <int>, CCC <int>

CodePudding user response：

Here's a solution using nested for loop.

The first loop goes over all rows in df, and the second loop goes over every possible non-overlapping codon.

for (i in 1:nrow(df)) {
  for (j in seq(1, nchar(df[i, 2]) - 3, 3)) {
    df[i, colnames(df) == substr(df[i, 2], j, j   2)] <- df[i, substr(df[i, 2], j, j   2)]   1
  }
}

Update: I've found a solution with the seqinr package.

library(seqinr)
library(stringr)

codon_list <- do.call(paste0, expand.grid(rep(list(c('A', 'G', 'T', 'C')), 3)))

df[, codon_list] <- t(sapply(1:nrow(df), function(x) uco(c(str_split(df[x, 2], "", simplify = T)), index = "eff")))

CodePudding user response：

Split the problem into sub-problems, solve them individually, and compose the solution.

The first subproblem is: how do I get codon frequencies of a given (in-frame) sequence? The answer is either to use a pre-made solution (e.g. Bioconductor’s Biostrings:: trinucleotideFrequency(…, steps = 3L)), or something quick and dirty like the following:

codon_frequencies = function (seq) {
    # Take care of incomplete codon at end.
    len = nchar(seq) - (nchar(seq) %% 3L)
    start = seq(1L, len, by = 3L)
    substring(seq, start, start   2L) |> table()
}

Try it:

codon_frequencies('TGGAGCCTCGCTC')
#
# AGC CTC GCT TGG
#   1   1   1   1

… incidentally, is it intentional that your sequences have fragmentary codons? If so, are you sure they always start on a full codon?

OK. The next step is calling this function for each gene ID in your table, and collecting the results. At this point, we’re helped by the fact that a counts table can be converted to a tidy data frame:

data.frame(codon_frequencies('TGGAGCCTCGCTC'))
#   Var1 Freq
# 1  AGC    1
# 2  CTC    1
# 3  GCT    1
# 4  TGG    1

For our purposes, this is a convenient format, because it makes table manipulation easier (especially when working in tidy data format, which I’m doing in the following using the packages ‘dplyr’, ‘tidyr’ and ‘purrr’):

df |>
    group_by(`Gene ID`) |>
    summarize(map_dfr(TM_domain_Seq, ~ data.frame(codon_frequencies(.x))))
# # A tibble: 20 × 3
# # Groups:   Gene ID [1]
#    `Gene ID`       Var1   Freq
#    <chr>           <fct> <int>
#  1 ENSG00000003989 AGC       1
#  2 ENSG00000003989 CTC       1
#  3 ENSG00000003989 GCT       1
#  4 ENSG00000003989 TGG       1
# …

At this point we could probably call it a day: this is a convenient format to work with. However, if you prefer, you can also pivot the data into wide format:

    … |>
    pivot_wider(
        id_cols = `Gene ID`,
        names_from = Var1,
        values_from = Freq,
        values_fill = 0L # Otherwise missing codons will be `NA`
    )
# # A tibble: 5 × 11
# # Groups:   Gene ID [5]
#   `Gene ID`         AGC   CTC   GCA   TGG   TGA   GCG   AGT   GAT   TAC   GCT
#   <chr>           <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 ENSG00000003981     1     1     1     1     0     0     0     0     0     0
# 2 ENSG00000003982     1     1     0     1     1     0     0     0     0     0
# 3 ENSG00000003983     1     1     0     1     0     1     0     0     0     0
# 4 ENSG00000003984     0     0     0     0     0     0     1     1     1     0
# 5 ENSG00000003989     1     1     0     1     0     0     0     0     0     1

(This is using some different toy data.)

Finally, if you want to have columns for all codons, even those not present in your data, you can make a small modification to the codon_frequencies function:

all_codons = c('A', 'C', 'G', 'T') %>% expand.grid(., ., .) |> apply(1L, paste, collapse = '')

codon_frequencies = function (seq, all = FALSE) {
    # Take care of incomplete codon at end.
    len = nchar(seq) - (nchar(seq) %% 3L)
    start = seq(1L, len, by = 3L)
    codons = substring(seq, start, start   2L)
    table(if (all) factor(codons, levels = all_codons) else codons)
}

And then call it as codon_frequencies(.x, all = TRUE) in the code above. The pivot_wider no longer needs the values_fill = 0L argument then.

Putting it all together:

df |>
    group_by(`Gene ID`) |>
    summarize(
        map_dfr(TM_domain_Seq, ~ data.frame(codon_frequencies(.x, all = TRUE))),
        .groups = 'drop'
    ) |>
    pivot_wider(
        id_cols = `Gene ID`,
        names_from = Var1,
        values_from = Freq
    )