I have these 10 numeric vectors. For simplicity, each containing 5 elements
a <- c(1,2,3,4,5)
b <- c(1,2,3,4,6)
c <- c(1,2,3,4,6)
d <- c(1,2,3,4,6)
e <- c(6,2,9,7,3)
f <- c(7,3,5,7,6)
g <- c(7,9,3,4,0)
h <- c(4,6,4,6,9)
i <- c(8,8,5,3,8)
j <- c(2,1,1,2,3)
I want to find 3 most related/similar vectors. It must be vector b, c, d.
Additionally, I also hoping to get another vectors composition besides the "most related" one (b, c, d). In this case, could be: (a, b, c)
, (a, b, d)
, (a, c, d )
.
The level of relation/similarity itself should have score
so I can find the most similar, second most similar etc.
Expected output is like this, more or less
similarity_rank vectors similarity_score (example)
1 b, c, d 0.99
2 a, b, c 0.8
etc.
My trial so far: I'm using pairwise correlation. It can find the relation between vectors but only 2 vectors. I want to get "similarity score" for those 3 vector (or for general purpose, n vectors)
Rules:
- n: Number of desired vectors
- N: Number of all vectors
- N > n
- All vectors are numeric
Question: What is the best method to do that? (R code will be amazing, R Package will be great, or only the method name is enough so I can learn about it)
CodePudding user response:
Put each vector as columns in a matrix, then calculate the cosine similarity of each column pair using crossprod
as in this answer. Then you could find the maximum n
values in each column.
v <- mapply(get, letters[1:10], mode = "numeric")
crossprod(v)/(sqrt(tcrossprod(colSums(v^2))))*(1 - diag(ncol(v)))
#> a b c d e f g h i j
#> a 0.0000000 0.9958592 0.9958592 0.9958592 0.8062730 0.8946692 0.5415304 0.9616223 0.8162174 0.9280323
#> b 0.9958592 0.0000000 1.0000000 1.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> c 0.9958592 1.0000000 0.0000000 1.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> d 0.9958592 1.0000000 1.0000000 0.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> e 0.8062730 0.7636241 0.7636241 0.7636241 0.0000000 0.9226539 0.6904075 0.7748305 0.7656671 0.7887775
#> f 0.8946692 0.8736978 0.8736978 0.8736978 0.9226539 0.0000000 0.7374396 0.9189132 0.8929772 0.9557896
#> g 0.5415304 0.4943473 0.4943473 0.4943473 0.6904075 0.7374396 0.0000000 0.6968355 0.8281550 0.6265219
#> h 0.9616223 0.9592858 0.9592858 0.9592858 0.7748305 0.9189132 0.6968355 0.0000000 0.9292092 0.9614179
#> i 0.8162174 0.8106045 0.8106045 0.8106045 0.7656671 0.8929772 0.8281550 0.9292092 0.0000000 0.9003699
#> j 0.9280323 0.9318911 0.9318911 0.9318911 0.7887775 0.9557896 0.6265219 0.9614179 0.9003699 0.0000000
CodePudding user response:
As you told that you can find the correlation between 2 vectors. You can store the result of correlation of every pair of your input numeric vector. It will be operation of O(n^2). Now you have score of every pair so you can create every set of three vectors and can check average of all three pairs of every set and can output the result according to that. For eg:- take set (a, b, c) you have three possible pairs here (a, b), (a, c) and (b, c). use the correlation score of pairs and take a average and store them in output in ascending order. That will be your result.