I have the following reference sequence:
reference_seq <- "KPAACQHRQDKWKNSHWNRFKAYFVVIKKK"
With this function that simply calculate the composition of amino acid sequence:
calculate_aa_content <- function (x)
{
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict),
maxsum = 21)/nchar(x)
AAC
}
I can get the AA composition of the reference sequence:
> calculate_aa_content(reference_seq)
A R N D C E
0.10000000 0.06666667 0.06666667 0.03333333 0.03333333 0.00000000
Q G H I L K
0.06666667 0.00000000 0.06666667 0.03333333 0.00000000 0.23333333
M F P S T W
0.00000000 0.06666667 0.03333333 0.03333333 0.00000000 0.06666667
Y V
0.03333333 0.06666667
Then I have the seed sequence:
seed <- "FKDHKHIDVKDRHRTRHLAK??????????"
What I want to do is to sample 10aa represented by 10 question marks (?) from 20 amino acid vector here:
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
So that the Pearson correlation of final constructed sequence from seed with reference sequence is maximized (allowing certain number of iterations if not possible to get absolute max).
For example one of the sample from seed sampling, could be:
> sample_1 <- "FKDHKHIDVKDRHRTRHLAKHHHHHHHHHH"
> cor(calculate_aa_content(reference_seq), calculate_aa_content(sample_1))
[1] 0.2955238
But the correlation is low. What I'd like to do is to find max correlation of constructed string from seed with reference sequence after certain number of iteration.
An additional feature with early stopping if the difference with current maximum is within certain threshold, say 0.01, would be appreciated.
Note that the final sequence should not be the same with reference squence.
How can achieve that efficiently with R?
CodePudding user response:
Write a function to simulate new sequences and compute the correlation. Then call the function R
times, get the cor
components and find the maximum.
calculate_aa_content <- function (x) {
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict),
maxsum = 21)/nchar(x)
AAC
}
fun <- function(x, ref, dict = AADict) {
i <- regexpr("\\?", x)
n <- nchar(x) - i 1L
new <- sample(dict, n, TRUE)
new <- paste(new, collapse = "")
substring(x, i) <- new
list(
new = x,
cor = cor(calculate_aa_content(ref), calculate_aa_content(x))
)
}
reference_seq <- "KPAACQHRQDKWKNSHWNRFKAYFVVIKKK"
seed <- "FKDHKHIDVKDRHRTRHLAK??????????"
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
set.seed(2022)
R <- 1e4
res <- replicate(R, fun(seed, reference_seq), simplify = FALSE)
cor_vec <- sapply(res, `[[`, 'cor')
res[which.max(cor_vec)]
#> [[1]]
#> [[1]]$new
#> [1] "FKDHKHIDVKDRHRTRHLAKKKWYKEKWFN"
#>
#> [[1]]$cor
#> [1] 0.7902037
Created on 2022-10-21 with reprex v2.0.2
In the event that there are more than one value equal to the maximum, use instead
res[cor_vec == max(cor_vec)]