How to use 1D-arrays from hdf5 file and perform operations such as subtraction, addition etc. on the-CodePudding

I have 1D arrays which looks like this:

array([(b'2P1', b'aP1', 2, 37.33,  4.4 , 3.82),
   (b'3P2', b'aP2', 3, 18.74, -9.67, 4.85),
   (b'4P2', b'aP2', 4, 55.16, 74.22, 4.88)],

as you can see these are mixed with strings. I cannot access them element wise e.g., if I want to subtract first row from the second row, only using the columns with floating numbers, I can't do that!! Are there any way around to do that? Here is the link for the hdf5 file data file. Here is the code for reading the hdf5 file:

import numpy as np
import h5py

with h5py.File('xaa.h5', 'r') as hdff:
    base_items = list(hdff.items())
    print('Items in the base directory: ', base_items)
    dat1 = np.array(hdff['particles/lipids/positions/dataset_0001'])
    dat2 = np.array(hdff['particles/lipids/positions/dataset_0002'])
    print(dat1)

CodePudding user response：

In [188]: f = h5py.File('../Downloads/xaa.h5')
In [189]: f
Out[189]: <HDF5 file "xaa.h5" (mode r)>
...
In [194]: f['particles/lipids/positions'].keys()
Out[194]: <KeysViewHDF5 ['dataset_0000', 'dataset_0001', 'dataset_0002', 'dataset_0003', 'dataset_0004', 'dataset_0005', 'dataset_0006', 'dataset_0007', 'dataset_0008', 'dataset_0009']>
...
In [196]: f['particles/lipids/positions/dataset_0000'].dtype
Out[196]: dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', '<i8'), ('col4', '<f8'), ('col5', '<f8'), ('col6', '<f8')])

As I suspected this is a structured array. https://numpy.org/doc/stable/user/basics.rec.html

In [202]: arr[0]
Out[202]: (b'1P1', b'aP1', 1, 80.48, 35.36, 4.25)
In [203]: arr['col1'][:10]
Out[203]: 
array([b'1P1', b'2P1', b'3P2', b'4P2', b'5P3', b'6P3', b'7P4', b'8P4',
       b'9P5', b'10P5'], dtype='|S7')

We can get a view of the float columns with:

In [204]: arr[['col4','col5','col6']][:10]
Out[204]: 
array([(80.48,  35.36, 4.25), (37.45,   3.92, 3.96),
       (18.53,  -9.69, 4.68), (55.39,  74.34, 4.6 ),
       (22.11,  68.71, 3.85), (-4.13,  24.04, 3.73),
       (40.16,   6.39, 4.73), (-5.4 ,  35.73, 4.85),
       (36.67,  22.45, 4.08), (-3.68, -10.66, 4.18)],
      dtype={'names':['col4','col5','col6'], 'formats':['<f8','<f8','<f8'], 'offsets':[23,31,39], 'itemsize':47})

But to treat those fields as 2d array we need to use a recfunctions utility:

In [198]: import numpy.lib.recfunctions as rf

In [205]: rf.structured_to_unstructured( arr[['col4','col5','col6']][:10])
Out[205]: 
array([[ 80.48,  35.36,   4.25],
       [ 37.45,   3.92,   3.96],
       [ 18.53,  -9.69,   4.68],
       [ 55.39,  74.34,   4.6 ],
       [ 22.11,  68.71,   3.85],
       [ -4.13,  24.04,   3.73],
       [ 40.16,   6.39,   4.73],
       [ -5.4 ,  35.73,   4.85],
       [ 36.67,  22.45,   4.08],
       [ -3.68, -10.66,   4.18]])

CodePudding user response：

Answer above is a good approach, but you don't have to use recfunctions. Once you know the dtype and shape of the dataset, you can create an empty array and populate by reading the data of interest using field slice notation as shown in the answer above.

Here is the code to do that. (Since we know you are reading 3 floats and float is the default dtype for np.empty(), I didn't bother getting the field dtypes from the dataset -- it would be easy to add if you need to slice integer or string fields.)

with h5py.File('xaa.h5', 'r') as hdf:
    grp = hdf['particles/lipids/positions']
    ds1 = grp['dataset_0000']
    nrows = ds1.shape[0]
    arr = np.empty((nrows,3))
    arr[:,0] = ds1['col4'][:]
    arr[:,1] = ds1['col5'][:]
    arr[:,2] = ds1['col6'][:]
    
    print(arr[0:10,:])

Output:

[[ 80.48  35.36   4.25]
 [ 37.45   3.92   3.96]
 [ 18.53  -9.69   4.68]
 [ 55.39  74.34   4.6 ]
 [ 22.11  68.71   3.85]
 [ -4.13  24.04   3.73]
 [ 40.16   6.39   4.73]
 [ -5.4   35.73   4.85]
 [ 36.67  22.45   4.08]
 [ -3.68 -10.66   4.18]]