EDIT 3: TL;DR My issue was due to my matrix not being sparse enough and also calculating the size of a sparse array incorrectly.
was hoping someone could explain to me why this is happening. I am using colab with 51 GB of memory and I need to load data from an H5 file, float32. I am able to load a test H5 file as numpy array and RAM ~ 45 GB. I loaded that in batches (21 total) and stack it. then I try to load the data into numpy convert into sparse and hstack the data and the memory explodes and I get an OOM after batch 12 or so.
this code simulates it and you can change the data size to test it on your computer. I get completely unexplainable memory increases even though when I look at the size of my variables in memory, they seem small. What is happening? what am I doing wrong?
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
for k in range(8):
if all_x is None:
all_x = x2
else:
all_x = sparse.hstack([all_x, x2])
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
GB on Memory SPARSE 0.481035332
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 0.6028389760464576
_____________________
GB on Memory ALL SPARSE 0.481035332
GB USED BEFORE GC 4.62065664
GB USED AFTER GC 4.6206976
_____________________
GB on Memory ALL SPARSE 0.962070664
GB USED BEFORE GC 8.473133056
GB USED AFTER GC 8.473133056
_____________________
GB on Memory ALL SPARSE 1.443105996
GB USED BEFORE GC 12.325183488
GB USED AFTER GC 12.325183488
_____________________
GB on Memory ALL SPARSE 1.924141328
GB USED BEFORE GC 17.140740096
GB USED AFTER GC 17.140740096
_____________________
GB on Memory ALL SPARSE 2.40517666
GB USED BEFORE GC 20.512710656
GB USED AFTER GC 20.512710656
_____________________
GB on Memory ALL SPARSE 2.886211992
GB USED BEFORE GC 22.920142848
GB USED AFTER GC 22.920142848
_____________________
GB on Memory ALL SPARSE 3.367247324
GB USED BEFORE GC 29.660889088
GB USED AFTER GC 29.660889088
_____________________
GB on Memory ALL SPARSE 3.848282656
GB USED BEFORE GC 33.99727104
GB USED AFTER GC 33.99727104
_____________________
EDIT: I stacked a list in numpy hstack and it works fine
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
all_x = np.hstack([x]*21)
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
output
GB on Memory SPARSE 0.480956104
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 0.6027396866113227
_____________________
GB on Memory ALL SPARSE 16.756948992
GB USED BEFORE GC 38.169387008
GB USED AFTER GC 38.169411584
_____________________
but when I do the same with sparse matrix I get an OOM. according to the bytes the sparse matrix should be smaller.
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
all_x = sparse.hstack([x2]*21)
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
but when i do above it returns OOM error
EDIT 2 it seems I was calculating the true size of the sparse matrix incorrectly. it can be calculated using
def bytes_in_sparse(a):
return a.data.nbytes a.indptr.nbytes a.indices.nbytes
the true comparison between the dense and sparse arrays are
GB on Memory SPARSE 0.962395268
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 1.2060847495357703
Once I use sparse.hstack
the two variables become different types of sparse matrices.
all_x, x2
outputs
(<97406x4096 sparse matrix of type '<class 'numpy.float32'>'
with 240476696 stored elements in COOrdinate format>,
<97406x2048 sparse matrix of type '<class 'numpy.float32'>'
with 120238348 stored elements in Compressed Sparse Row format>)
CodePudding user response:
With smaller dimensions so I don't hang my computer
In [50]: x = (1 * (np.random.rand(974, 204) > 0.39721115241072164)).astype("float32")
In [51]: x.nbytes
Out[51]: 794784
THe csr and approximate memory use:
In [52]: M = sparse.csr_matrix(x)
In [53]: M.data.nbytes M.indices.nbytes M.indptr.nbytes
Out[53]: 960308
hstack
actually uses the coo
format:
In [54]: Mo = M.tocoo()
In [55]: Mo.data.nbytes Mo.row.nbytes Mo.col.nbytes
Out[55]: 1434612
Combining 10 copies - nbytes increases by 10:
In [56]: xx = np.hstack([x]*10)
In [57]: xx.shape
Out[57]: (974, 2040)
Same with sparse:
In [58]: MM = sparse.hstack([M] * 10)
In [59]: MM.shape
Out[59]: (974, 2040)
In [60]: xx.nbytes
Out[60]: 7947840
In [61]: MM
Out[61]:
<974x2040 sparse matrix of type '<class 'numpy.float32'>'
with 1195510 stored elements in Compressed Sparse Row format>
In [62]: M
Out[62]:
<974x204 sparse matrix of type '<class 'numpy.float32'>'
with 119551 stored elements in Compressed Sparse Row format>
In [63]: MM.data.nbytes MM.indices.nbytes MM.indptr.nbytes
Out[63]: 9567980
A sparse density of
In [65]: M.nnz / np.prod(M.shape)
Out[65]: 0.6016779401699078
does not save memory. 0.1 or smaller is a good working density if you want to both save memory and computation time (especially matrix multiplication).
In [66]: ([email protected]).shape
Out[66]: (974, 974)
In [67]: timeit([email protected]).shape
10.1 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [68]: ([email protected]).shape
Out[68]: (974, 974)
In [69]: timeit([email protected]).shape
220 ms ± 91.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)