Problem: We need a big data method for calculating distances between points. We outline what we'd like to do below with a five-observation dataframe. However, this particular method is infeasible as the number of rows gets large (> 1 million). In the past, we've used SAS to do this kind of analysis, but we'd prefer R if possible. (Note: I'm not going to show code because, while I outline a way to do this on smaller datasets below, this is basically an impossible method to use with data on our scale.)
We start with a dataframe of stores, each of which has a latitude and longitude (though this is not a spatial file, nor do we want to use a spatial file).
# you can think of x and y in this example as Cartesian coordinates
stores <- data.frame(id = 1:5,
x = c(1, 0, 1, 2, 0),
y = c(1, 2, 0, 2, 0))
stores
id x y
1 1 1 1
2 2 0 2
3 3 1 0
4 4 2 2
5 5 0 0
For each store, we want to know the number of stores within x distance. In a small dataframe, this is straightforward. Create another dataframe of all coordinates, merge back in, calculate distances, create an indicator if the distance is less than x and add up the indicators (minus one for the store itself, which is at distance 0). This would result in a dataset that looks like this:
id x y s1.dist s2.dist s3.dist s4.dist s5.dist
1: 1 1 1 0.000000 1.414214 1.000000 1.414214 1.414214
2: 2 0 2 1.414214 0.000000 2.236068 2.000000 2.000000
3: 3 1 0 1.000000 2.236068 0.000000 2.236068 1.000000
4: 4 2 2 1.414214 2.000000 2.236068 0.000000 2.828427
5: 5 0 0 1.414214 2.000000 1.000000 2.828427 0.000000
When you count (arbitrarily) under 1.45 as "close," you end up with indicators that look like this:
# don't include the store itself in the total
id x y s1.close s2.close s3.close s4.close s5.close total.close
1: 1 1 1 1 1 1 1 1 4
2: 2 0 2 1 1 0 0 0 1
3: 3 1 0 1 0 1 0 1 2
4: 4 2 2 1 0 0 1 0 1
5: 5 0 0 1 0 1 0 1 2
The final product should look like this:
id total.close
1: 1 4
2: 2 1
3: 3 2
4: 4 1
5: 5 2
All advice appreciated.
Thank you very much
CodePudding user response:
Any reason you can't loop instead of making it one big calculation?
stores <- data.frame(id = 1:5,
x = c(1, 0, 1, 2, 0),
y = c(1, 2, 0, 2, 0))
# Here's a Euclidean distance metric, but you can drop anything you want in here
distfun <- function(x0, y0, x1, y1){
sqrt((x1-x0)^2 (y1-y0)^2)
}
# Loop over each store
t(sapply(seq_len(nrow(stores)), function(i){
distances <- distfun(x0 = stores$x[i], x1 = stores$x,
y0 = stores$y[i], y1 = stores$y)
# Calculate number less than arbitrary cutoff, subtract one for self
num_within <- sum(distances<1.45)-1
c(stores$id[i], num_within)
}))
Produces:
[,1] [,2]
[1,] 1 4
[2,] 2 1
[3,] 3 2
[4,] 4 1
[5,] 5 2
This will work with a data set of any size that you can bring into R, but it'll just get slower as the size increases. Here's a test on 10,000 entries that runs in a couple seconds on my machine:
stores <- data.frame(id=1:10000,
x=runif(10000, max = 10),
y=runif(10000, max = 10))
[,1] [,2]
[1,] 1 679
[2,] 2 698
[3,] 3 618
[4,] 4 434
[5,] 5 402
...
[9995,] 9995 529
[9996,] 9996 626
[9997,] 9997 649
[9998,] 9998 514
[9999,] 9999 667
[10000,] 10000 603
It get slower with more calculations (because it has to run between every pair of points, this will always be O(n^2)) but without knowing the actual distance metric you'd like to calculate we can't optimize the slow part any further.
CodePudding user response:
Did you really already try the classic dist()
function? The core is implemented in C
and should thus be fast.
Probably the coercion to a matrix (which takes place in dist anyway) already costs a lot of time, maybe it could be read in immediately as a matrix and not first as a data frame.
M <- as.matrix(stores[-1])
dist(M, diag=TRUE, upper=TRUE)
# 1 2 3 4 5
# 1 0.000000 1.414214 1.000000 1.414214 1.414214
# 2 1.414214 0.000000 2.236068 2.000000 2.000000
# 3 1.000000 2.236068 0.000000 2.236068 1.000000
# 4 1.414214 2.000000 2.236068 0.000000 2.828427
# 5 1.414214 2.000000 1.000000 2.828427 0.000000
Otherwise you could try this C
implementation which is basically a copy of @coatless's code. However, I used Rcpp
package for use in an R script.
library(Rcpp)
cppFunction('Rcpp::NumericMatrix calcPWD1 (const Rcpp::NumericMatrix & x){
unsigned int outrows = x.nrow(), i = 0, j = 0;
double d;
Rcpp::NumericMatrix out(outrows,outrows);
for (i = 0; i < outrows - 1; i ){
Rcpp::NumericVector v1 = x.row(i);
for (j = i 1; j < outrows ; j ){
d = sqrt(sum(pow(v1-x.row(j), 2.0)));
out(j,i)=d;
out(i,j)=d;
}
}
return out;
}')
calcPWD1(M)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.000000 1.414214 1.000000 1.414214 1.414214
# [2,] 1.414214 0.000000 2.236068 2.000000 2.000000
# [3,] 1.000000 2.236068 0.000000 2.236068 1.000000
# [4,] 1.414214 2.000000 2.236068 0.000000 2.828427
# [5,] 1.414214 2.000000 1.000000 2.828427 0.000000
However, the benchmark is yet clearly in favor of dist
, so you should give it a try:
M_big <- M[sample(nrow(M), 1e4, replace=TRUE), ] ## inflate to 10k rows
microbenchmark::microbenchmark(
dist=dist(M_big, diag=TRUE, upper=TRUE),
calcPWD1=calcPWD1(M_big),
control=list(warmup=10L),
times=3L
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dist 640.1861 660.1396 765.881 680.093 828.7284 977.3638 3 a
# calcPWD1 1419.4106 1439.1353 1505.253 1458.860 1548.1736 1637.4873 3 b
Be sure to read @coatless's and Dirk Eddelbuettel's answers where they write some more about C
, C
, and R
and have other versions of the function.