Home > Blockchain >  Saving large sparse arrays in hdf5 using pickle
Saving large sparse arrays in hdf5 using pickle

Time:01-18

In my code I am generating a list of large sparse arrays that are in csr format. I want to store these arrays to file. I was initially saving them to file in this way:

from scipy.sparse import csr_matrix
import h5py

As = [ csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]]),
       csr_matrix([[1, 0, 0], [0, 1, 0], [0, 0, 1]]),
       csr_matrix([[2, 0, 0], [0, 3, 0], [0, 0, 4]]) ]

np_matrices = [mat.toarray() for mat in As]

with h5py.File(filename, "w") as f:
    f.create_dataset("matrices", data=np_matrices)

In this way however, I run into out of memory error, as I am trying to allocate the memory at once. Therefore it's not possible to save more than 1000 of these sparse arrays. I already checked the scipy.sparse.save_npz() library but this allows me to save only 1 csr_matrix per file, which i don't want that as I have to generate and store more than 100K matrices. I therefore started to check pickle to serialize the sparse matrix objects:

pickled_obj = pickle.dumps(As)

with h5py.File('obj.hdf5', 'w') as f:
    dset = f.create_dataset('obj', data=pickled_obj) 

But this leads to the following error: VLEN strings do not support embedded NULLs

Is there a way to deal with this error? Or anyone has a better way to save a list of csr_matrix with good memory performance?

CodePudding user response:

You could consider using numpy for this. I simply use a numpy array, fill it with the csr arrays and then use the numpy.save() function with allow_pickle=True. I am currently using it with up to 50k arrays in one.npy file and don't have any problems regarding reading or writing speed or any other errors.

CodePudding user response:

Here is a POC for saving and loading your data for multiple sparse matrix:

from scipy.sparse import csr_matrix
import numpy

As = [ csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]]),
       csr_matrix([[1, 0, 0], [0, 1, 0], [0, 0, 1]]),
       csr_matrix([[2, 0, 0], [0, 3, 0], [0, 0, 4]]) ]

np_matrices = numpy.array(As)

numpy.save("obj.npy", np_matrices, allow_pickle=True)
np_matrices_loaded = numpy.load("obj.npy", allow_pickle=True)

print(np_matrices)
print(np_matrices_loaded)

As said in the answer of @Ben0981 numpy is optimized to store sparse matrix and have a compression.

Output:

[<3x3 sparse matrix of type '<class 'numpy.int64'>'
        with 5 stored elements in Compressed Sparse Row format>
 <3x3 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>
 <3x3 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>]
[<3x3 sparse matrix of type '<class 'numpy.int64'>'
        with 5 stored elements in Compressed Sparse Row format>
 <3x3 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>
 <3x3 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>]
  • Related