How to maximize correlation by sampling using R-CodePudding

I have the following reference sequence:

reference_seq <- "KPAACQHRQDKWKNSHWNRFKAYFVVIKKK"

With this function that simply calculate the composition of amino acid sequence:

calculate_aa_content <- function (x) 
{

  AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
              "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
  AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict), 
                 maxsum = 21)/nchar(x)
  AAC
}

I can get the AA composition of the reference sequence:

> calculate_aa_content(reference_seq)
         A          R          N          D          C          E 
0.10000000 0.06666667 0.06666667 0.03333333 0.03333333 0.00000000 
         Q          G          H          I          L          K 
0.06666667 0.00000000 0.06666667 0.03333333 0.00000000 0.23333333 
         M          F          P          S          T          W 
0.00000000 0.06666667 0.03333333 0.03333333 0.00000000 0.06666667 
         Y          V 
0.03333333 0.06666667

Then I have the seed sequence:

seed          <- "FKDHKHIDVKDRHRTRHLAK??????????"

What I want to do is to sample 10aa represented by 10 question marks (?) from 20 amino acid vector here:

 AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
                  "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

So that the Pearson correlation of final constructed sequence from seed with reference sequence is maximized (allowing certain number of iterations if not possible to get absolute max).

For example one of the sample from seed sampling, could be:

> sample_1 <- "FKDHKHIDVKDRHRTRHLAKHHHHHHHHHH"
> cor(calculate_aa_content(reference_seq), calculate_aa_content(sample_1))
[1] 0.2955238

But the correlation is low. What I'd like to do is to find max correlation of constructed string from seed with reference sequence after certain number of iteration.

An additional feature with early stopping if the difference with current maximum is within certain threshold, say 0.01, would be appreciated.

Note that the final sequence should not be the same with reference squence.

How can achieve that efficiently with R?

CodePudding user response：

Write a function to simulate new sequences and compute the correlation. Then call the function R times, get the cor components and find the maximum.

calculate_aa_content <- function (x) {
  AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
              "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
  AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict), 
                 maxsum = 21)/nchar(x)
  AAC
}

fun <- function(x, ref, dict = AADict) {
  i <- regexpr("\\?", x)
  n <- nchar(x) - i   1L
  new <- sample(dict, n, TRUE)
  new <- paste(new, collapse = "")
  substring(x, i) <- new
  list(
    new = x,
    cor = cor(calculate_aa_content(ref), calculate_aa_content(x))
  )
}

reference_seq <- "KPAACQHRQDKWKNSHWNRFKAYFVVIKKK"
seed <- "FKDHKHIDVKDRHRTRHLAK??????????"

AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

set.seed(2022)
R <- 1e4
res <- replicate(R, fun(seed, reference_seq), simplify = FALSE)
cor_vec <- sapply(res, `[[`, 'cor')
res[which.max(cor_vec)]
#> [[1]]
#> [[1]]$new
#> [1] "FKDHKHIDVKDRHRTRHLAKKKWYKEKWFN"
#> 
#> [[1]]$cor
#> [1] 0.7902037

^{Created on 2022-10-21 with reprex v2.0.2}

In the event that there are more than one value equal to the maximum, use instead

res[cor_vec == max(cor_vec)]