Home > Net >  R: creating a distance matrix quickly (like using mapply() or similar)
R: creating a distance matrix quickly (like using mapply() or similar)

Time:10-10

I am looking to create a distance matrix for any arbitrary non-standard distance function.

I can do this the slow way as follows:

set.seed(1000)
DF <- data.frame(x=rnorm(10),y=rnorm(10))       # ten random points on the x y plane
L <- dim(DF)[1]     # length of DF
F <- function(P1,P2,y){sqrt((P2$x-P1$x)^2   (P2$y-P1$y)^2   1)} 
# Almost the euclidean distance but with an added 1 to make it nonstandard

M <- matrix(nrow=L,ncol=L)

# Find the distances between every point in DF and every other point in DF
for(i in 1:L){
    for(j in 1:L){
        M[i,j] <- F(DF[i,],DF[j,])
    }
}

M

which gives:

      [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]     [,9]    [,10]
 [1,] 1.000000 1.326971 1.566994 1.708761 1.114078 1.527042 1.514868 1.836636 1.521510 1.813663
 [2,] 1.326971 1.000000 1.735444 2.143117 1.336652 1.482555 1.427014 2.245816 2.153173 1.271712
 [3,] 1.566994 1.735444 1.000000 1.190212 1.951701 1.088288 1.126241 1.212367 2.388228 1.734505
 [4,] 1.708761 2.143117 1.190212 1.000000 2.123664 1.461169 1.523137 1.013764 2.267420 2.271950
 [5,] 1.114078 1.336652 1.951701 2.123664 1.000000 1.851806 1.822077 2.263007 1.447333 1.934958
 [6,] 1.527042 1.482555 1.088288 1.461169 1.851806 1.000000 1.004188 1.497537 2.459305 1.406153
 [7,] 1.514868 1.427014 1.126241 1.523137 1.822077 1.004188 1.000000 1.564111 2.460997 1.344779
 [8,] 1.836636 2.245816 1.212367 1.013764 2.263007 1.497537 1.564111 1.000000 2.415824 2.327128
 [9,] 1.521510 2.153173 2.388228 2.267420 1.447333 2.459305 2.460997 2.415824 1.000000 2.818048
[10,] 1.813663 1.271712 1.734505 2.271950 1.934958 1.406153 1.344779 2.327128 2.818048 1.000000

Obviously, with 2 nested for loops in R, this will be very slow for datasets of any size. I would like to speed this up by using a function such as mapply() or outer() but am unsure of how to do it.

I've had a good look for similar questions but I can't find one that give an adequate answer that doesn't involve rcpp.

Create a distance matrix in R using parallelization

Create custom distance matrix function in R

Speed Up Distance Calculations

Trying the advice given in this link below gives me:

pairwise comparison with all vectors of a list

outer(DF,DF,FUN=Vectorize(F))
Error: $ operator is invalid for atomic vectors

or

outer(DF,DF,FUN=F)
Error in dim(robj) <- c(dX, dY) : 
dims [product 4] do not match the length of object [10]

CodePudding user response:

Here is how to use outer to replace a nested loop and use the custom distance function

set.seed(1000)
DF <- data.frame(x=rnorm(10),y=rnorm(10))       
L <- dim(DF)[1]     
F <- function(P1,P2){sqrt((P2$x-P1$x)^2   (P2$y-P1$y)^2   1)} 
M <- matrix(nrow=L,ncol=L)

outer(1:L, 1:L, FUN=function(x, y) F(DF[x,], DF[y,]))
          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]     [,9]    [,10]
 [1,] 1.000000 1.326971 1.566994 1.708761 1.114078 1.527042 1.514868 1.836636 1.521510 1.813663
 [2,] 1.326971 1.000000 1.735444 2.143117 1.336652 1.482555 1.427014 2.245816 2.153173 1.271712
 [3,] 1.566994 1.735444 1.000000 1.190212 1.951701 1.088288 1.126241 1.212367 2.388228 1.734505
 [4,] 1.708761 2.143117 1.190212 1.000000 2.123664 1.461169 1.523137 1.013764 2.267420 2.271950
 [5,] 1.114078 1.336652 1.951701 2.123664 1.000000 1.851806 1.822077 2.263007 1.447333 1.934958
 [6,] 1.527042 1.482555 1.088288 1.461169 1.851806 1.000000 1.004188 1.497537 2.459305 1.406153
 [7,] 1.514868 1.427014 1.126241 1.523137 1.822077 1.004188 1.000000 1.564111 2.460997 1.344779
 [8,] 1.836636 2.245816 1.212367 1.013764 2.263007 1.497537 1.564111 1.000000 2.415824 2.327128
 [9,] 1.521510 2.153173 2.388228 2.267420 1.447333 2.459305 2.460997 2.415824 1.000000 2.818048
[10,] 1.813663 1.271712 1.734505 2.271950 1.934958 1.406153 1.344779 2.327128 2.818048 1.000000

Benchmark with DF <- data.frame(x=rnorm(100),y=rnorm(100)) 100x100

Unit: milliseconds
  expr        min         lq       mean     median         uq        max neval
  loop 647.080268 681.283754 720.842738 695.972994 728.078378 1057.16015   100
 outer   7.892903   8.145765   8.661221   8.307392   8.710785   14.07253   100

CodePudding user response:

You can use a nice simple method included in base R to calculate distances in dataframes of points (2D or 3D)

dist(DF, method = "euclidean", diag =TRUE, upper = TRUE) 

If you only want the lower triangle leave out upper=TRUE, and if you do not want to see the Zero values for the diagonal on your triangle set diag=FALSE

This function can also to manhattan, minkowski and canabera distances as well. Super simple

Understanding what you now want there is a package for R called usedist, it offers some methods for defining matrices and functions for the application of distance measures.

It has a function `dist_make() which applies a function to each pair of rows in a matrix (not dataframe)

You will need to figure out how to retool your function to align a matrix of your data

Here is the documentation

  • Related