Problem: I have a large data.table dt
with about a million x and y values. These x-y combinations stand for events on a 2-D plane. I know the dimensions of that plane (iwidth
, iheight
).
I want to create a matrix that is 0 everywhere except in the x-y value that that are listed in the data.table. At these spots the matrix value should be 1. In general this is easily done, but with a million x-y values to set, regular approaches are not suited.
Approach: Since not every x-y combination will be represented in the data.table, I first create a 0-matrix with the right dimensions. Then I replace the 0s at the points indicated by the data.table into 1s.
## initial setup (for easier testing we just use a data.frame, not a data.table)
iwidth = 4288
iheight = 8576
dt = data.frame( xval=sample(iwidth ,10), yval=sample(iheight ,10) )
## simple approach
mx = matrix(ncol=iwidth, nrow=iheight, data=0)
mx[dt$xval, dt$yval] = 1
## biganalytics approach
library(biganalytics)
mx = as.big.matrix(matrix(ncol=iwidth, nrow=iheight, data=0))
mx[dt$xval, dt$yval] = 1
Failure: For small data this works perfectly fine. However, when you actually have a data.table with a million rows, it takes forever. I thought the biganalytics
package might help, but this is true only for small data, while it is actually worse with large data (see benchmark below). I also tried apply
or with
but for me they did not work either (I think they should be even slower).
These are the microbenchmark results (with n=1) for the above approach (the dt5, dt50, etc. stand for a data.table with 5 rows, 50 rows, etc.). The time it takes increases extremely once we reach long data.tables (i.e. many values to replace in the matrix).
## Regular matrix:
Unit: milliseconds
expr min lq mean median uq max neval
dt5 130.8255 130.8255 130.8255 130.8255 130.8255 130.8255 1
dt50 87.2308 87.2308 87.2308 87.2308 87.2308 87.2308 1
dt500 86.7591 86.7591 86.7591 86.7591 86.7591 86.7591 1
dt5000 129.6120 129.6120 129.6120 129.6120 129.6120 129.6120 1
dt50000 4340.6080 4340.6080 4340.6080 4340.6080 4340.6080 4340.6080 1
## Biganalytics matrix:
Unit: milliseconds
expr min lq mean median uq max neval
dt5 0.988101 0.988101 0.988101 0.988101 0.988101 0.988101 1
dt50 0.779401 0.779401 0.779401 0.779401 0.779401 0.779401 1
dt500 9.814602 9.814602 9.814602 9.814602 9.814602 9.814602 1
dt5000 202.574901 202.574901 202.574901 202.574901 202.574901 202.574901 1
dt50000 19939.191600 19939.191600 19939.191600 19939.191600 19939.191600 19939.191600 1
CodePudding user response:
The following Rcpp function might be what you are looking for:
Rcpp::cppFunction("NumericMatrix
coords_to_matrix(int ncols, int nrows,
NumericVector x_coords,
NumericVector y_coords) {
if(x_coords.size() != y_coords.size())
stop(\"x_coords and y_coords must be same length\");
NumericMatrix m(nrows, ncols);
for(int i = 0; i < x_coords.size(); i )
{
if((x_coords[i] > ncols - 1) ||
(y_coords[i] > nrows - 1)) continue;
m[y_coords[i] - 1 (x_coords[i] - 1) * m.nrow()] = 1;
}
return m;
}")
This seems to do what you need.
For example:
set.seed(1)
dt <- data.frame(x = sample(10), y = sample(10))
dt
#> x y
#> 1 9 3
#> 2 4 1
#> 3 7 5
#> 4 1 8
#> 5 2 2
#> 6 5 6
#> 7 3 10
#> 8 10 9
#> 9 6 4
#> 10 8 7
mat <- coords_to_matrix(10, 10, dt$x, dt$y)
mat
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0 0 0 1 0 0 0 0 0 0
#> [2,] 0 1 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0 1 0
#> [4,] 0 0 0 0 0 1 0 0 0 0
#> [5,] 0 0 0 0 0 0 1 0 0 0
#> [6,] 0 0 0 0 1 0 0 0 0 0
#> [7,] 0 0 0 0 0 0 0 1 0 0
#> [8,] 1 0 0 0 0 0 0 0 0 0
#> [9,] 0 0 0 0 0 0 0 0 0 1
#> [10,] 0 0 1 0 0 0 0 0 0 0
And it seems to run considerably faster than any of your current options:
iwidth = 4288
iheight = 8576
dt5 = data.frame( xval=sample(iwidth ,5), yval=sample(iheight ,5) )
dt50 = data.frame( xval=sample(iwidth ,50), yval=sample(iheight ,50) )
dt500 = data.frame( xval=sample(iwidth ,500), yval=sample(iheight ,500) )
dt5000 = data.frame( xval=sample(iwidth ,5000, replace = TRUE),
yval=sample(iheight ,5000, replace = TRUE) )
dt50000 = data.frame( xval=sample(iwidth ,50000, replace = TRUE),
yval=sample(iheight ,50000, replace = TRUE) )
microbenchmark::microbenchmark(
m5 = m5 <- coords_to_matrix(iwidth, iheight, dt5$xval, dt5$yval),
m50 = m50 <- coords_to_matrix(iwidth, iheight, dt50$xval, dt50$yval),
m500 = m500 <- coords_to_matrix(iwidth, iheight, dt500$xval, dt500$yval),
m5000 = m5000 <- coords_to_matrix(iwidth, iheight, dt5000$xval, dt5000$yval),
m50000 = m50000 <- coords_to_matrix(iwidth, iheight, dt50000$xval, dt50000$yval),
times = 10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> m5 45.5397 55.2420 105.15879 60.25800 83.0363 284.8644 10 a
#> m50 45.3205 53.1242 127.77022 58.02275 294.3918 305.8922 10 a
#> m500 45.3013 45.4073 98.20344 53.51115 55.8047 292.2100 10 a
#> m5000 45.4192 45.7605 76.51107 54.57740 55.7256 278.7359 10 a
#> m50000 46.2567 49.4814 104.44953 56.87705 78.4683 302.9901 10 a