Home > other >  What method to find n most similar/related vectors within N numeric vectors in R? (N > n)
What method to find n most similar/related vectors within N numeric vectors in R? (N > n)

Time:10-24

I have these 10 numeric vectors. For simplicity, each containing 5 elements

a <- c(1,2,3,4,5)
b <- c(1,2,3,4,6)
c <- c(1,2,3,4,6)
d <- c(1,2,3,4,6)
e <- c(6,2,9,7,3)
f <- c(7,3,5,7,6)
g <- c(7,9,3,4,0)
h <- c(4,6,4,6,9)
i <- c(8,8,5,3,8)
j <- c(2,1,1,2,3)

I want to find 3 most related/similar vectors. It must be vector b, c, d.

Additionally, I also hoping to get another vectors composition besides the "most related" one (b, c, d). In this case, could be: (a, b, c), (a, b, d), (a, c, d ). The level of relation/similarity itself should have score so I can find the most similar, second most similar etc.

Expected output is like this, more or less

similarity_rank   vectors   similarity_score (example)
1                 b, c, d   0.99
2                 a, b, c   0.8
etc.

My trial so far: I'm using pairwise correlation. It can find the relation between vectors but only 2 vectors. I want to get "similarity score" for those 3 vector (or for general purpose, n vectors)

Rules:

  • n: Number of desired vectors
  • N: Number of all vectors
  • N > n
  • All vectors are numeric

Question: What is the best method to do that? (R code will be amazing, R Package will be great, or only the method name is enough so I can learn about it)

CodePudding user response:

Put each vector as columns in a matrix, then calculate the cosine similarity of each column pair using crossprod as in this answer. Then you could find the maximum n values in each column.

v <- mapply(get, letters[1:10], mode = "numeric")
crossprod(v)/(sqrt(tcrossprod(colSums(v^2))))*(1 - diag(ncol(v)))
#>           a         b         c         d         e         f         g         h         i         j
#> a 0.0000000 0.9958592 0.9958592 0.9958592 0.8062730 0.8946692 0.5415304 0.9616223 0.8162174 0.9280323
#> b 0.9958592 0.0000000 1.0000000 1.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> c 0.9958592 1.0000000 0.0000000 1.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> d 0.9958592 1.0000000 1.0000000 0.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> e 0.8062730 0.7636241 0.7636241 0.7636241 0.0000000 0.9226539 0.6904075 0.7748305 0.7656671 0.7887775
#> f 0.8946692 0.8736978 0.8736978 0.8736978 0.9226539 0.0000000 0.7374396 0.9189132 0.8929772 0.9557896
#> g 0.5415304 0.4943473 0.4943473 0.4943473 0.6904075 0.7374396 0.0000000 0.6968355 0.8281550 0.6265219
#> h 0.9616223 0.9592858 0.9592858 0.9592858 0.7748305 0.9189132 0.6968355 0.0000000 0.9292092 0.9614179
#> i 0.8162174 0.8106045 0.8106045 0.8106045 0.7656671 0.8929772 0.8281550 0.9292092 0.0000000 0.9003699
#> j 0.9280323 0.9318911 0.9318911 0.9318911 0.7887775 0.9557896 0.6265219 0.9614179 0.9003699 0.0000000

CodePudding user response:

As you told that you can find the correlation between 2 vectors. You can store the result of correlation of every pair of your input numeric vector. It will be operation of O(n^2). Now you have score of every pair so you can create every set of three vectors and can check average of all three pairs of every set and can output the result according to that. For eg:- take set (a, b, c) you have three possible pairs here (a, b), (a, c) and (b, c). use the correlation score of pairs and take a average and store them in output in ascending order. That will be your result.

  • Related