I have a large .tiff file (4.4gB, 79530 x 54980 values) with 1 band. Since only 16% of the values are valid, I was thinking it's better to import the file as sparse matrix, to save RAM. When I first open it as np.array
and then transform it into a sparse matrix using csr_matrix()
, my kernel already crashes. See code below.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
array = np.array(band.ReadAsArray())
csr_matrix(array)
Is there a better way to work with this file? In the end I have to make calculations based on the values in the raster. (Unfortunately, due to confidentiality, I cannot attach the relevant file.)
CodePudding user response:
Can you tell where the crash occurs?
band = ds.GetRasterBand(1)
temp = band.ReadAsArray()
array = np.array(temp) # if temp is already an array, you don't need this
csr_matrix(array)
If array
is 4.4gB, (79530, 54980)
In [62]: (79530 * 54980) / 1e9
Out[62]: 4.3725594 # 4.4gB makes sense for 1 byte/element
In [63]: (79530 * 54980) * 0.16 # 16% density
Out[63]: 699609504.0 # number of nonzero values
creating csr
requires doing np.nonzero(array)
to get the indices. That will produce 2 arrays of this 0.7 * 8 Gb size (indices are 8 byte ints). coo
format actually requires those 2 arrays plus 0.7 for the nonzero values - about 12 Gb . Converted to csr
, the row
attribute is reduced to 79530 elements - so about 7 Gb . (corrected for 8 bytes/element)
So at 16% density, the sparse format is, at it's best, is still larger than the dense version.
Memory error when converting matrix to sparse matrix, specified dtype is invalid
is a recent case of a memory error - which occurred in nonzero
step.
CodePudding user response:
Assuming you know size of your matrix, you can create an empty sparse matrix, and then set only valid values one-by-one.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
matrix_size = (1000, 1000) # set you size
matrix = csr_matrix(matrix_size)
# for each valid value
matrix[i, j] = your_value
Edit 1
If you don't know size of your matrix, you should be able to check it like this:
from osgeo import gdal
ds = gdal.Open("file.tif")
width = ds.GetRasterXSize()
height = ds.GetRasterYSize()
matrix_size = (width, height)
Edit 2
I measured metrices suggested in comments (filled to the full). This is how I measured memory usage.
size 500x500
matrix | empty size | full size | filling time |
---|---|---|---|
csr_matrix | 2856 | 2992 | 477.67 s |
doc_matrix | 726 | 35807578 | 3.15 s |
lil_matrix | 8840 | 8840 | 0.54 s |
size 1000x1000
matrix | empty size | full size | filling time |
---|---|---|---|
csr_matrix | 4856 | 4992 | 7164.94 s |
doc_matrix | 726 | 150578858 | 12.81 s |
lil_matrix | 16840 | 16840 | 2.19 s |
Probably the best solution would be to use lil_matrix