how to create hdf5 file from numpy dataset files-CodePudding

I have 1970 .npy files as features for MSVD dataset. I want to create one .hdf5 file from these numpy files.

import os 
import numpy as np
import hdf5


TRAIN_FEATURE_DIR = "MSVD"   

for filename in os.listdir(TRAIN_FEATURE_DIR):
    f = np.load(os.path.join(TRAIN_FEATURE_DIR, filename))
...

CodePudding user response：

Creating a dataset from an array is easy. Example below loops over all .npy files in a folder and creates 1 dataset for each array. (FYI, I prefer glob.iglob() to get the filenames using a wildcard.) Dataset name is the same as the filename.

import glob 
import numpy as np
import h5py

with h5py.File('SO_74788877.h5','w') as h5f:
    for filename in glob.iglob('*.npy'):
        arr = np.load(filename)
        h5f.create_dataset(filename,data=arr)

This code shows how to access the dataset names and values from the H5 file created above. (dataset is a dataset object which behaves like a numpy array in many instances):

with h5py.File('SO_74788877.h5','r') as h5f:
    for name, dataset in h5f.items():
        print(name, dataset.shape, dataset.dtype)

CodePudding user response：

The following code solved my problem:

import os 
import numpy as np
import h5py


TRAIN_FEATURE_DIR = "MSVD"                    # MSVD ==> numpy folder path 

h5 = h5py.File("out.hdf5", 'w')               # out ==> output hdf5 file name

for filename in os.listdir(TRAIN_FEATURE_DIR):
    
    video_id = os.path.splitext(filename)[0]  # optional, to remove '.npy'   
    video_id = video_id.split('.')[0]         # optional, to remove '.avi' from video_id
    
    f = np.load(os.path.join(TRAIN_FEATURE_DIR, filename))
    h5[video_id] = f
   
     
h5.close()