I have a dataset composed as:
dataset = [{"sample":[numpy array (2048,3) shape], "category":"Cat"}, ....]
Each element of the list is a dictionary containing a key "sample" and its value is a numpy array that has shape (2048,3) and the category is the class of that sample. The dataset len is 8000.
I tried to save in JSON but it said it can't serialize numpy arrays.
What's the best way to save this list? I can't use np.save("file", dataset)
because there is a dictionary and I can't use JSON because there is the numpy array. Should I use HDF5? What format should I use if I have to use the dataset for machine learning?
Thanks!
CodePudding user response:
Creating an example specific to your data requires more details about the dictionaries in the list. I created an example that assumes every dictionary has:
- A unique value for the
category
key. The value is used for the dataset name. - There is a
sample
key with the array you want to save.
Code below creates some data, loads to a HDF5 file with h5py package, then reads the data back into a new list of dictionaries. It is a good starting point for your problem.
import numpy as np
import h5py
a0, a1 = 10, 5
arr1 = np.arange(a0*a1).reshape(a0,a1)
arr2 = np.arange(a0*a1,2*a0*a1).reshape(a0,a1)
arr3 = np.arange(2*a0*a1,3*a0*a1).reshape(a0,a1)
dataset = [{"sample":arr1, "category":"Cat"},
{"sample":arr2, "category":"Dog"},
{"sample":arr3, "category":"Fish"},
]
# Create the HDF5 file with "category" as dataset name and "sample" as the data
with h5py.File('SO_73499414.h5', 'w') as h5f:
for ds_dict in dataset:
h5f.create_dataset(ds_dict["category"], data=ds_dict["sample"])
# Retrieve the HDF5 data with "category" as dataset name and "sample" as the data
ds_list = []
with h5py.File('SO_73499414.h5', 'r') as h5f:
for ds_name in h5f:
print(ds_name,'\n',h5f[ds_name]) # prints name and dataset attributes
print(h5f[ds_name][()]) # prints the dataset values (as an array)
# add data and name to list
ds_list.append({"sample":h5f[ds_name][()], "category":ds_name})
Here is a second method when category values aren't unique.
a0, a1 = 10, 5
arr1 = np.arange(a0*a1).reshape(a0,a1)
arr2 = np.arange(a0*a1,2*a0*a1).reshape(a0,a1)
arr3 = np.arange(2*a0*a1,3*a0*a1).reshape(a0,a1)
arr4 = np.arange(3*a0*a1,4*a0*a1).reshape(a0,a1)
dataset = [{"sample":arr1, "category":"Cat"},
{"sample":arr2, "category":"Dog"},
{"sample":arr3, "category":"Cat"},
{"sample":arr4, "category":"Dog"}
]
# Create the HDF5 file with dataset name using counter and "sample" as the data
# "category" is savee as a dataset attribute
with h5py.File('SO_73499414.h5', 'w') as h5f:
for i, ds_dict in enumerate(dataset):
ds = h5f.create_dataset(f'ds_{i:04}', data=ds_dict["sample"])
ds.attrs["category"] = ds_dict["category"]
# Retrieve the HDF5 data with "sample" as the data and "category" from the attribute
ds_list = []
with h5py.File('SO_73499414.h5', 'r') as h5f:
for ds_name in h5f:
print(ds_name,'\n',h5f[ds_name]) # prints name and dataset attributes
print(h5f[ds_name].attrs["category"]) # prints the category attribute
print(h5f[ds_name][()]) # prints the dataset values (as an array)
# add data and name to list
ds_list.append({"sample":h5f[ds_name][()], "category":h5f[ds_name].attrs["category"]})