Home > Enterprise >  Setting many values in large matrix in R
Setting many values in large matrix in R

Time:11-03

Problem: I have a large data.table dt with about a million x and y values. These x-y combinations stand for events on a 2-D plane. I know the dimensions of that plane (iwidth, iheight).

I want to create a matrix that is 0 everywhere except in the x-y value that that are listed in the data.table. At these spots the matrix value should be 1. In general this is easily done, but with a million x-y values to set, regular approaches are not suited.

Approach: Since not every x-y combination will be represented in the data.table, I first create a 0-matrix with the right dimensions. Then I replace the 0s at the points indicated by the data.table into 1s.

## initial setup (for easier testing we just use a data.frame, not a data.table)
iwidth = 4288
iheight = 8576
dt = data.frame( xval=sample(iwidth ,10), yval=sample(iheight ,10) )

## simple approach
mx = matrix(ncol=iwidth, nrow=iheight, data=0)
mx[dt$xval, dt$yval] = 1

## biganalytics approach
library(biganalytics)
mx = as.big.matrix(matrix(ncol=iwidth, nrow=iheight, data=0))
mx[dt$xval, dt$yval] = 1

Failure: For small data this works perfectly fine. However, when you actually have a data.table with a million rows, it takes forever. I thought the biganalytics package might help, but this is true only for small data, while it is actually worse with large data (see benchmark below). I also tried apply or with but for me they did not work either (I think they should be even slower).

These are the microbenchmark results (with n=1) for the above approach (the dt5, dt50, etc. stand for a data.table with 5 rows, 50 rows, etc.). The time it takes increases extremely once we reach long data.tables (i.e. many values to replace in the matrix).

## Regular matrix:
Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval
    dt5  130.8255  130.8255  130.8255  130.8255  130.8255  130.8255     1
   dt50   87.2308   87.2308   87.2308   87.2308   87.2308   87.2308     1
  dt500   86.7591   86.7591   86.7591   86.7591   86.7591   86.7591     1
 dt5000  129.6120  129.6120  129.6120  129.6120  129.6120  129.6120     1
dt50000 4340.6080 4340.6080 4340.6080 4340.6080 4340.6080 4340.6080     1

## Biganalytics matrix:
Unit: milliseconds
   expr          min           lq         mean       median           uq          max neval
    dt5     0.988101     0.988101     0.988101     0.988101     0.988101     0.988101     1
   dt50     0.779401     0.779401     0.779401     0.779401     0.779401     0.779401     1
  dt500     9.814602     9.814602     9.814602     9.814602     9.814602     9.814602     1
 dt5000   202.574901   202.574901   202.574901   202.574901   202.574901   202.574901     1
dt50000 19939.191600 19939.191600 19939.191600 19939.191600 19939.191600 19939.191600     1

CodePudding user response:

The following Rcpp function might be what you are looking for:

Rcpp::cppFunction("NumericMatrix 
                  coords_to_matrix(int ncols, int nrows, 
                                   NumericVector x_coords,
                                   NumericVector y_coords) {
                  if(x_coords.size() != y_coords.size()) 
                    stop(\"x_coords and y_coords must be same length\");
                  NumericMatrix m(nrows, ncols);
                  for(int i = 0; i < x_coords.size(); i  )
                  {
                    if((x_coords[i] > ncols - 1) || 
                       (y_coords[i] > nrows - 1)) continue;
                    m[y_coords[i] - 1   (x_coords[i] - 1) * m.nrow()] = 1;
                  }
                  return m; 
                  }")

This seems to do what you need.

For example:

set.seed(1)
dt <- data.frame(x = sample(10), y = sample(10))
dt
#>     x  y
#> 1   9  3
#> 2   4  1
#> 3   7  5
#> 4   1  8
#> 5   2  2
#> 6   5  6
#> 7   3 10
#> 8  10  9
#> 9   6  4
#> 10  8  7

mat <- coords_to_matrix(10, 10, dt$x, dt$y)
mat
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,]    0    0    0    1    0    0    0    0    0     0
#>  [2,]    0    1    0    0    0    0    0    0    0     0
#>  [3,]    0    0    0    0    0    0    0    0    1     0
#>  [4,]    0    0    0    0    0    1    0    0    0     0
#>  [5,]    0    0    0    0    0    0    1    0    0     0
#>  [6,]    0    0    0    0    1    0    0    0    0     0
#>  [7,]    0    0    0    0    0    0    0    1    0     0
#>  [8,]    1    0    0    0    0    0    0    0    0     0
#>  [9,]    0    0    0    0    0    0    0    0    0     1
#> [10,]    0    0    1    0    0    0    0    0    0     0

And it seems to run considerably faster than any of your current options:

iwidth = 4288
iheight = 8576
dt5 = data.frame( xval=sample(iwidth ,5), yval=sample(iheight ,5) )
dt50 = data.frame( xval=sample(iwidth ,50), yval=sample(iheight ,50) )
dt500 = data.frame( xval=sample(iwidth ,500), yval=sample(iheight ,500) )
dt5000 = data.frame( xval=sample(iwidth ,5000, replace = TRUE), 
                     yval=sample(iheight ,5000, replace = TRUE) )
dt50000 =  data.frame( xval=sample(iwidth ,50000, replace = TRUE), 
                       yval=sample(iheight ,50000, replace = TRUE) )

microbenchmark::microbenchmark(
  m5 = m5 <- coords_to_matrix(iwidth, iheight, dt5$xval, dt5$yval),
  m50 = m50 <- coords_to_matrix(iwidth, iheight, dt50$xval, dt50$yval),
  m500 = m500 <- coords_to_matrix(iwidth, iheight, dt500$xval, dt500$yval),
  m5000 = m5000 <- coords_to_matrix(iwidth, iheight, dt5000$xval, dt5000$yval),
  m50000 = m50000 <- coords_to_matrix(iwidth, iheight, dt50000$xval, dt50000$yval),
  times = 10)
#> Unit: milliseconds
#>    expr     min      lq      mean   median       uq      max neval cld
#>      m5 45.5397 55.2420 105.15879 60.25800  83.0363 284.8644    10   a
#>     m50 45.3205 53.1242 127.77022 58.02275 294.3918 305.8922    10   a
#>    m500 45.3013 45.4073  98.20344 53.51115  55.8047 292.2100    10   a
#>   m5000 45.4192 45.7605  76.51107 54.57740  55.7256 278.7359    10   a
#>  m50000 46.2567 49.4814 104.44953 56.87705  78.4683 302.9901    10   a
  • Related