Home > Enterprise >  Save a list of dictionaries with numpy arrays
Save a list of dictionaries with numpy arrays

Time:08-29

I have a dataset composed as:

dataset = [{"sample":[numpy array (2048,3) shape], "category":"Cat"}, ....]

Each element of the list is a dictionary containing a key "sample" and its value is a numpy array that has shape (2048,3) and the category is the class of that sample. The dataset len is 8000.

I tried to save in JSON but it said it can't serialize numpy arrays.

What's the best way to save this list? I can't use np.save("file", dataset) because there is a dictionary and I can't use JSON because there is the numpy array. Should I use HDF5? What format should I use if I have to use the dataset for machine learning? Thanks!

CodePudding user response:

Creating an example specific to your data requires more details about the dictionaries in the list. I created an example that assumes every dictionary has:

  • A unique value for the category key. The value is used for the dataset name.
  • There is a sample key with the array you want to save.

Code below creates some data, loads to a HDF5 file with h5py package, then reads the data back into a new list of dictionaries. It is a good starting point for your problem.

import numpy as np
import h5py

a0, a1 = 10, 5
arr1 = np.arange(a0*a1).reshape(a0,a1)
arr2 = np.arange(a0*a1,2*a0*a1).reshape(a0,a1)
arr3 = np.arange(2*a0*a1,3*a0*a1).reshape(a0,a1)

dataset = [{"sample":arr1, "category":"Cat"}, 
           {"sample":arr2, "category":"Dog"},
           {"sample":arr3, "category":"Fish"},
           ]

# Create the HDF5 file with "category" as dataset name and "sample" as the data
with h5py.File('SO_73499414.h5', 'w') as h5f:
    for ds_dict in dataset:
        h5f.create_dataset(ds_dict["category"], data=ds_dict["sample"])

# Retrieve the HDF5 data with "category" as dataset name and "sample" as the data
ds_list = []
with h5py.File('SO_73499414.h5', 'r') as h5f:
    for ds_name in h5f:
        print(ds_name,'\n',h5f[ds_name]) # prints name and dataset attributes
        print(h5f[ds_name][()]) # prints the dataset values (as an array) 
        # add data and name to list
        ds_list.append({"sample":h5f[ds_name][()], "category":ds_name})

Here is a second method when category values aren't unique.

a0, a1 = 10, 5
arr1 = np.arange(a0*a1).reshape(a0,a1)
arr2 = np.arange(a0*a1,2*a0*a1).reshape(a0,a1)
arr3 = np.arange(2*a0*a1,3*a0*a1).reshape(a0,a1)
arr4 = np.arange(3*a0*a1,4*a0*a1).reshape(a0,a1)

dataset = [{"sample":arr1, "category":"Cat"}, 
           {"sample":arr2, "category":"Dog"},
           {"sample":arr3, "category":"Cat"},
           {"sample":arr4, "category":"Dog"}
           ]

# Create the HDF5 file with  dataset name using counter and "sample" as the data
# "category" is savee as a dataset attribute
with h5py.File('SO_73499414.h5', 'w') as h5f:
    for i, ds_dict in enumerate(dataset):
        ds = h5f.create_dataset(f'ds_{i:04}', data=ds_dict["sample"])
        ds.attrs["category"] = ds_dict["category"]

# Retrieve the HDF5 data with  "sample" as the data and "category" from the attribute
ds_list = []
with h5py.File('SO_73499414.h5', 'r') as h5f:
    for ds_name in h5f:
        print(ds_name,'\n',h5f[ds_name]) # prints name and dataset attributes
        print(h5f[ds_name].attrs["category"]) # prints the category attribute
        print(h5f[ds_name][()]) # prints the dataset values (as an array) 
        
        # add data and name to list
        ds_list.append({"sample":h5f[ds_name][()], "category":h5f[ds_name].attrs["category"]})
  • Related