Home > front end >  How to write large multiple arrays to a h5 file in layers?
How to write large multiple arrays to a h5 file in layers?

Time:03-03

Suppose I have 10000 systems. For each system I have 2 datasets: for each data set I have x,y and y_err arrays. How can I put the data for all the systems into a h5 file, either using h5py or pandas? Detailed description is given below.

Systems=np.arange(10000)

for sys in Systems:
    x1,y1,y1_err=np.random.rand(100),np.random.rand(100),np.random.rand(100)
    x2,y2,y2_err=np.random.rand(200),np.random.rand(200),np.random.rand(200)

I want to put x1,y1,y1_err,x2,y2,y2_err for all the systems in to a h5 file in a structured manner.

Sorry, this might be very elementary task but I am really struggling.

CodePudding user response:

I think this should work:

df = pd.DataFrame(columns=['system','x1','y1','y1_err','x2','y2','y2_err'])

Systems=np.arange(10000)

for i, sys in enumerate(Systems):
    x1,y1,y1_err=np.random.rand(100),np.random.rand(100),np.random.rand(100)
    x2,y2,y2_err=np.random.rand(200),np.random.rand(200),np.random.rand(200)
    temp = (pd.DataFrame([x1,y1,y1_err,x2,y2,y2_err], index=['x1','y1','y1_err','x2','y2','y2_err'])).transpose()
    temp["system"] = i
    df = pd.concat([df, temp])

df.to_hdf('data.h5', key='key')

CodePudding user response:

Two other methods to create HDF5 files are the h5py and PyTables packages. They are similar but each has unique strengths. The thing I like about both: when you open the HDF5 file with HDFView, you can view the data in a simple table layout (like a spreadsheet).

I wrote an example for each. Only a 2 functions are different: 1) creating groups with create_group() and creating datasets with h5py create_dataset vs PyTables create_table. Both use a numpy recarray to name the data columns (aka x1,y1,y1_err). The process is slightly simpler if you don't want to name the columns and all the data is the same type (e.g., all floats or all ints).

Here is the process for h5py:

import h5py
import numpy as np

table1_dt = np.dtype([('x1',float), ('y1',float), ('y1_err',float),])
table2_dt = np.dtype([('x2',float), ('y2',float), ('y2_err',float),])

Systems=np.arange(10_000)

with h5py.File('SO_71335363.h5','w') as h5f:
    
    for sys in Systems:
        grp = h5f.create_group(f'System_{sys:05}')
        x1,y1,y1_err=np.random.rand(100),np.random.rand(100),np.random.rand(100)
        t1_arr = np.empty(dtype=table1_dt,shape=(x1.shape[0],))
        t1_arr['x1'] = x1
        t1_arr['y1'] = y1
        t1_arr['y1_err'] = y1_err       
        grp.create_dataset('table1',data=t1_arr)
        
        x2,y2,y2_err=np.random.rand(200),np.random.rand(200),np.random.rand(200)
        t2_arr = np.empty(dtype=table2_dt,shape=(x2.shape[0],))
        t2_arr['x2'] = x2
        t2_arr['y2'] = y2
        t2_arr['y2_err'] = y2_err       
        grp.create_dataset('table2',data=t2_arr)

Here is the same procedure with PyTables (package is import tables):

import tables as tb # (this is PyTables)
import numpy as np

table1_dt = np.dtype([('x1',float), ('y1',float), ('y1_err',float),])
table2_dt = np.dtype([('x2',float), ('y2',float), ('y2_err',float),])

Systems=np.arange(10_000)

with tb.File('SO_71335363_tb.h5','w') as h5f:
    
    for sys in Systems:
        grp = h5f.create_group('/',f'System_{sys:05}')
        x1,y1,y1_err=np.random.rand(100),np.random.rand(100),np.random.rand(100)
        t1_arr = np.empty(dtype=table1_dt,shape=(x1.shape[0],))
        t1_arr['x1'] = x1
        t1_arr['y1'] = y1
        t1_arr['y1_err'] = y1_err       
        h5f.create_table(grp,'table1',obj=t1_arr)
        
        x2,y2,y2_err=np.random.rand(200),np.random.rand(200),np.random.rand(200)
        t2_arr = np.empty(dtype=table2_dt,shape=(x2.shape[0],))
        t2_arr['x2'] = x2
        t2_arr['y2'] = y2
        t2_arr['y2_err'] = y2_err       
        h5f.create_table(grp,'table2',obj=t2_arr)
  • Related