In my code I am generating a list of large sparse arrays that are in csr
format. I want to store these arrays to file. I was initially saving them to file in this way:
from scipy.sparse import csr_matrix
import h5py
As = [ csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]]),
csr_matrix([[1, 0, 0], [0, 1, 0], [0, 0, 1]]),
csr_matrix([[2, 0, 0], [0, 3, 0], [0, 0, 4]]) ]
np_matrices = [mat.toarray() for mat in As]
with h5py.File(filename, "w") as f:
f.create_dataset("matrices", data=np_matrices)
In this way however, I run into out of memory error, as I am trying to allocate the memory at once. Therefore it's not possible to save more than 1000 of these sparse arrays. I already checked the scipy.sparse.save_npz()
library but this allows me to save only 1 csr_matrix
per file, which i don't want that as I have to generate and store more than 100K matrices. I therefore started to check pickle
to serialize the sparse matrix objects:
pickled_obj = pickle.dumps(As)
with h5py.File('obj.hdf5', 'w') as f:
dset = f.create_dataset('obj', data=pickled_obj)
But this leads to the following error: VLEN strings do not support embedded NULLs
Is there a way to deal with this error? Or anyone has a better way to save a list of csr_matrix
with good memory performance?
CodePudding user response:
You could consider using numpy for this. I simply use a numpy array, fill it with the csr arrays and then use the numpy.save()
function with allow_pickle=True
. I am currently using it with up to 50k arrays in one.npy file and don't have any problems regarding reading or writing speed or any other errors.
CodePudding user response:
Here is a POC for saving and loading your data for multiple sparse matrix:
from scipy.sparse import csr_matrix
import numpy
As = [ csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]]),
csr_matrix([[1, 0, 0], [0, 1, 0], [0, 0, 1]]),
csr_matrix([[2, 0, 0], [0, 3, 0], [0, 0, 4]]) ]
np_matrices = numpy.array(As)
numpy.save("obj.npy", np_matrices, allow_pickle=True)
np_matrices_loaded = numpy.load("obj.npy", allow_pickle=True)
print(np_matrices)
print(np_matrices_loaded)
As said in the answer of @Ben0981 numpy is optimized to store sparse matrix and have a compression.
Output:
[<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>]
[<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>]