I have a problem in doing sequence alignment/matching in R for lists. Let me explain better, my data are clickstream data and i have sequences divided in n-grams. The sequence looks something like
1. ABDCGHEI... NaNa
2. ACSNa.... NaNa
and so on where Na stays for "Not available", needed to match sequence lengths. Now i put all of these sequences in a list in a rude way like
dativec = as.vector(dataseq2)
for(i in 1:length(dativec)) {
prova2[[i]] = dativec[i]
}
BigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
prova3 = lapply(prova2, BigramTokenizer)
and divided them in n-grams, e. g. bigrams looks like this:
[[1]] "A B" "B D" "D C".... "Na Na"
[[2]] "A C" "C S" .... "Na Na"
Now the challenge is : how can i match every bigram of each element of my list, with each bigram of the other elements in the list?
I tried to use the Biostrings
package but the function pairwiseAlignment
only gives back a score for the first bigram of each element in the list, while i just need to know if they're identical or not, and i need it all comparisons not just the first elements. The desired result is the percentage of equal sub-ngrams without the information about positions. I only care about identity. I also tried to use setdiff
function but apparently it doesn't work in the way i want.
Edited for more clarity
CodePudding user response:
You can use outer
:
bigrams <- list (a = c("A B", "B D", "D C", "Na Na"),
b = c("A C", "C S", "Na Na"))
with(bigrams, outer(a, b, `==`))
##> [,1] [,2] [,3]
##> [1,] FALSE FALSE FALSE
##> [2,] FALSE FALSE FALSE
##> [3,] FALSE FALSE FALSE
##> [4,] FALSE FALSE TRUE