I want to create a model where I duplicate a sentence several times, introducing random error each time. The duplicates of the sentence also get duplicated. So, in cycle one, I start with just "example_sentence". In cycle two, I have two copies of that sentence. In cycle three, I have 4 copies of that sentence. I want to do this for 25 cycles with 20k sentences. The code I wrote to do that works way too slowly, and I am wondering if there is a way to make my nested for loops more efficient? Here is the part of the code that is the slowest:
alphabet <- c("a","b","d","j")
modr1 <- "sentencetoduplicate"
errorRate <- c()
errorRate <- append(errorRate, rep(1,1))
errorRate <- append(errorRate, rep(0,999))
duplicate <- c(modr1)
for (q in 1:25) {
collect <- c()
for (z in seq_along(duplicate)) {
modr1 = duplicate[z]
compile1 <- c()
for (k in 1:nchar(modr1)) {
error <- sample(errorRate, 1, replace = TRUE)
if (error == 1) {
compile1 <- append(compile1, sub(substring(modr1,k,k),sample(alphabet,1,replace=TRUE),substring(modr1,k,k)))
} else {
compile1 <- append(compile1, substring(modr1,k,k))
}
}
modr1 <- paste(compile1, collapse='')
collect <- append(collect, modr1)
}
duplicate <- append(duplicate, collect)
}
CodePudding user response:
Here is a faster approach to your loop, but I think the problem of applying this to your problem of 20K sentences remains!
f <- function(let, alphabet = c("a","b","c","d","j"),error_rate=1/1000) {
lenlet=length(let)
let = unlist(let)
k <- rbinom(length(let),1,prob = error_rate)
let[k==1] <- sample(alphabet,size = sum(k==1), replace=T)
return(as.list(as.data.frame(matrix(let, ncol=lenlet))))
}
modr1 <- "sentencetoduplicate"
k <- data.table(list(strsplit(modr1,"")[[1]]))
for(q in 1:25) {
k[, V1:=list(f(V1))]
k <- k[rep(1:nrow(k),2)]
}
Updated with slightly faster version! (Notice this is no longer by=1:nrow(k)
)