Big data ways to calculate sets of distances in R?-CodePudding

Problem: We need a big data method for calculating distances between points. We outline what we'd like to do below with a five-observation dataframe. However, this particular method is infeasible as the number of rows gets large (> 1 million). In the past, we've used SAS to do this kind of analysis, but we'd prefer R if possible. (Note: I'm not going to show code because, while I outline a way to do this on smaller datasets below, this is basically an impossible method to use with data on our scale.)

We start with a dataframe of stores, each of which has a latitude and longitude (though this is not a spatial file, nor do we want to use a spatial file).

# you can think of x and y in this example as Cartesian coordinates
stores <- data.frame(id = 1:5,
                     x = c(1, 0, 1, 2, 0),
                     y = c(1, 2, 0, 2, 0))

stores
  id x y
1  1 1 1
2  2 0 2
3  3 1 0
4  4 2 2
5  5 0 0

For each store, we want to know the number of stores within x distance. In a small dataframe, this is straightforward. Create another dataframe of all coordinates, merge back in, calculate distances, create an indicator if the distance is less than x and add up the indicators (minus one for the store itself, which is at distance 0). This would result in a dataset that looks like this:

   id x y  s1.dist  s2.dist  s3.dist  s4.dist  s5.dist
1:  1 1 1 0.000000 1.414214 1.000000 1.414214 1.414214
2:  2 0 2 1.414214 0.000000 2.236068 2.000000 2.000000
3:  3 1 0 1.000000 2.236068 0.000000 2.236068 1.000000
4:  4 2 2 1.414214 2.000000 2.236068 0.000000 2.828427
5:  5 0 0 1.414214 2.000000 1.000000 2.828427 0.000000

When you count (arbitrarily) under 1.45 as "close," you end up with indicators that look like this:

# don't include the store itself in the total
   id x y s1.close s2.close s3.close s4.close s5.close total.close
1:  1 1 1        1        1        1        1        1           4
2:  2 0 2        1        1        0        0        0           1
3:  3 1 0        1        0        1        0        1           2
4:  4 2 2        1        0        0        1        0           1
5:  5 0 0        1        0        1        0        1           2

The final product should look like this:

   id total.close
1:  1           4
2:  2           1
3:  3           2
4:  4           1
5:  5           2

All advice appreciated.

Thank you very much

CodePudding user response：

Any reason you can't loop instead of making it one big calculation?

stores <- data.frame(id = 1:5,
                     x = c(1, 0, 1, 2, 0),
                     y = c(1, 2, 0, 2, 0))

# Here's a Euclidean distance metric, but you can drop anything you want in here
distfun <- function(x0, y0, x1, y1){
  sqrt((x1-x0)^2 (y1-y0)^2)
}

# Loop over each store
t(sapply(seq_len(nrow(stores)), function(i){
  distances <- distfun(x0 = stores$x[i], x1 = stores$x,
                       y0 = stores$y[i], y1 = stores$y)
  # Calculate number less than arbitrary cutoff, subtract one for self
  num_within <- sum(distances<1.45)-1
  c(stores$id[i], num_within)
}))

Produces:

     [,1] [,2]
[1,]    1    4
[2,]    2    1
[3,]    3    2
[4,]    4    1
[5,]    5    2

This will work with a data set of any size that you can bring into R, but it'll just get slower as the size increases. Here's a test on 10,000 entries that runs in a couple seconds on my machine:

stores <- data.frame(id=1:10000, 
                     x=runif(10000, max = 10), 
                     y=runif(10000, max = 10))

          [,1] [,2]
    [1,]     1  679
    [2,]     2  698
    [3,]     3  618
    [4,]     4  434
    [5,]     5  402
...
 [9995,]  9995  529
 [9996,]  9996  626
 [9997,]  9997  649
 [9998,]  9998  514
 [9999,]  9999  667
[10000,] 10000  603

It get slower with more calculations (because it has to run between every pair of points, this will always be O(n^2)) but without knowing the actual distance metric you'd like to calculate we can't optimize the slow part any further.

CodePudding user response：

Did you really already try the classic dist() function? The core is implemented in C and should thus be fast.

Probably the coercion to a matrix (which takes place in dist anyway) already costs a lot of time, maybe it could be read in immediately as a matrix and not first as a data frame.

M <- as.matrix(stores[-1])

dist(M, diag=TRUE, upper=TRUE)
#          1        2        3        4        5
# 1 0.000000 1.414214 1.000000 1.414214 1.414214
# 2 1.414214 0.000000 2.236068 2.000000 2.000000
# 3 1.000000 2.236068 0.000000 2.236068 1.000000
# 4 1.414214 2.000000 2.236068 0.000000 2.828427
# 5 1.414214 2.000000 1.000000 2.828427 0.000000

Otherwise you could try this C implementation which is basically a copy of @coatless's code. However, I used Rcpp package for use in an R script.

library(Rcpp)
cppFunction('Rcpp::NumericMatrix calcPWD1 (const Rcpp::NumericMatrix & x){
  unsigned int outrows = x.nrow(), i = 0, j = 0;
  double d;
  Rcpp::NumericMatrix out(outrows,outrows);

  for (i = 0; i < outrows - 1; i  ){
    Rcpp::NumericVector v1 = x.row(i);
    for (j = i   1; j < outrows ; j   ){
      d = sqrt(sum(pow(v1-x.row(j), 2.0)));
      out(j,i)=d;
      out(i,j)=d;
    }
  }

  return out;
}')

calcPWD1(M)
#          [,1]     [,2]     [,3]     [,4]     [,5]
# [1,] 0.000000 1.414214 1.000000 1.414214 1.414214
# [2,] 1.414214 0.000000 2.236068 2.000000 2.000000
# [3,] 1.000000 2.236068 0.000000 2.236068 1.000000
# [4,] 1.414214 2.000000 2.236068 0.000000 2.828427
# [5,] 1.414214 2.000000 1.000000 2.828427 0.000000

However, the benchmark is yet clearly in favor of dist, so you should give it a try:

M_big <- M[sample(nrow(M), 1e4, replace=TRUE), ]  ## inflate to 10k rows
microbenchmark::microbenchmark(
  dist=dist(M_big, diag=TRUE, upper=TRUE),
  calcPWD1=calcPWD1(M_big),
  control=list(warmup=10L),
  times=3L
)
# Unit: milliseconds
#     expr       min        lq     mean   median        uq       max neval cld
#     dist  640.1861  660.1396  765.881  680.093  828.7284  977.3638     3  a 
# calcPWD1 1419.4106 1439.1353 1505.253 1458.860 1548.1736 1637.4873     3   b

Be sure to read @coatless's and Dirk Eddelbuettel's answers where they write some more about C, C , and R and have other versions of the function.