I am looking to create a distance matrix for any arbitrary non-standard distance function.
I can do this the slow way as follows:
set.seed(1000)
DF <- data.frame(x=rnorm(10),y=rnorm(10)) # ten random points on the x y plane
L <- dim(DF)[1] # length of DF
F <- function(P1,P2,y){sqrt((P2$x-P1$x)^2 (P2$y-P1$y)^2 1)}
# Almost the euclidean distance but with an added 1 to make it nonstandard
M <- matrix(nrow=L,ncol=L)
# Find the distances between every point in DF and every other point in DF
for(i in 1:L){
for(j in 1:L){
M[i,j] <- F(DF[i,],DF[j,])
}
}
M
which gives:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1.000000 1.326971 1.566994 1.708761 1.114078 1.527042 1.514868 1.836636 1.521510 1.813663
[2,] 1.326971 1.000000 1.735444 2.143117 1.336652 1.482555 1.427014 2.245816 2.153173 1.271712
[3,] 1.566994 1.735444 1.000000 1.190212 1.951701 1.088288 1.126241 1.212367 2.388228 1.734505
[4,] 1.708761 2.143117 1.190212 1.000000 2.123664 1.461169 1.523137 1.013764 2.267420 2.271950
[5,] 1.114078 1.336652 1.951701 2.123664 1.000000 1.851806 1.822077 2.263007 1.447333 1.934958
[6,] 1.527042 1.482555 1.088288 1.461169 1.851806 1.000000 1.004188 1.497537 2.459305 1.406153
[7,] 1.514868 1.427014 1.126241 1.523137 1.822077 1.004188 1.000000 1.564111 2.460997 1.344779
[8,] 1.836636 2.245816 1.212367 1.013764 2.263007 1.497537 1.564111 1.000000 2.415824 2.327128
[9,] 1.521510 2.153173 2.388228 2.267420 1.447333 2.459305 2.460997 2.415824 1.000000 2.818048
[10,] 1.813663 1.271712 1.734505 2.271950 1.934958 1.406153 1.344779 2.327128 2.818048 1.000000
Obviously, with 2 nested for loops in R, this will be very slow for datasets of any size.
I would like to speed this up by using a function such as mapply()
or outer()
but am unsure of how to do it.
I've had a good look for similar questions but I can't find one that give an adequate answer that doesn't involve rcpp.
Create a distance matrix in R using parallelization
Create custom distance matrix function in R
Speed Up Distance Calculations
Trying the advice given in this link below gives me:
pairwise comparison with all vectors of a list
outer(DF,DF,FUN=Vectorize(F))
Error: $ operator is invalid for atomic vectors
or
outer(DF,DF,FUN=F)
Error in dim(robj) <- c(dX, dY) :
dims [product 4] do not match the length of object [10]
CodePudding user response:
Here is how to use outer
to replace a nested loop and use the custom distance function
set.seed(1000)
DF <- data.frame(x=rnorm(10),y=rnorm(10))
L <- dim(DF)[1]
F <- function(P1,P2){sqrt((P2$x-P1$x)^2 (P2$y-P1$y)^2 1)}
M <- matrix(nrow=L,ncol=L)
outer(1:L, 1:L, FUN=function(x, y) F(DF[x,], DF[y,]))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1.000000 1.326971 1.566994 1.708761 1.114078 1.527042 1.514868 1.836636 1.521510 1.813663 [2,] 1.326971 1.000000 1.735444 2.143117 1.336652 1.482555 1.427014 2.245816 2.153173 1.271712 [3,] 1.566994 1.735444 1.000000 1.190212 1.951701 1.088288 1.126241 1.212367 2.388228 1.734505 [4,] 1.708761 2.143117 1.190212 1.000000 2.123664 1.461169 1.523137 1.013764 2.267420 2.271950 [5,] 1.114078 1.336652 1.951701 2.123664 1.000000 1.851806 1.822077 2.263007 1.447333 1.934958 [6,] 1.527042 1.482555 1.088288 1.461169 1.851806 1.000000 1.004188 1.497537 2.459305 1.406153 [7,] 1.514868 1.427014 1.126241 1.523137 1.822077 1.004188 1.000000 1.564111 2.460997 1.344779 [8,] 1.836636 2.245816 1.212367 1.013764 2.263007 1.497537 1.564111 1.000000 2.415824 2.327128 [9,] 1.521510 2.153173 2.388228 2.267420 1.447333 2.459305 2.460997 2.415824 1.000000 2.818048 [10,] 1.813663 1.271712 1.734505 2.271950 1.934958 1.406153 1.344779 2.327128 2.818048 1.000000
Benchmark with DF <- data.frame(x=rnorm(100),y=rnorm(100))
100x100
Unit: milliseconds
expr min lq mean median uq max neval
loop 647.080268 681.283754 720.842738 695.972994 728.078378 1057.16015 100
outer 7.892903 8.145765 8.661221 8.307392 8.710785 14.07253 100
CodePudding user response:
You can use a nice simple method included in base R to calculate distances in dataframes of points (2D or 3D)
dist(DF, method = "euclidean", diag =TRUE, upper = TRUE)
If you only want the lower triangle leave out upper=TRUE
, and if you do not want to see the Zero values for the diagonal on your triangle set diag=FALSE
This function can also to manhattan, minkowski and canabera distances as well. Super simple
Understanding what you now want there is a package for R called usedist, it offers some methods for defining matrices and functions for the application of distance measures.
It has a function `dist_make() which applies a function to each pair of rows in a matrix (not dataframe)
You will need to figure out how to retool your function to align a matrix of your data