I am sampling time series data off various machines, and every so often need to collect a large high frequency burst of data from another device and append it to the time series data.
Imagine I am measuring temperature over time, and then every 10 degrees increase in temperature I sample a micro at 200khz, I want to be able to tag the large burst of micro data to a timestamp in the time-series data. Maybe even in the form of a figure.
I was trying to do this with regionref
, but am struggling to find a elegant solution. and I'm finding myself juggling between pandas store and h5py and it just feels messy.
Initially I thought I would be able to make separate datasets from the burst-data then use reference or links to timestamps in the time-series data. But no luck so far.
Any way to reference a large packet of data to a timestamp in another pile of data would be appreciated!
CodePudding user response:
How did use region references? I assume you had an array of references, with references alternating between a range of "standard rate" and "burst rate" data. That is a valid approach, and it will work. However, you are correct: it's messy to create, and messy to recover the data.
Virtual Datasets might be a more elegant solution....but tracking and creating the virtual layout definitions could get messy too. :-) However, once you have the virtual data set, you can read it with typical slice notation. HDF5/h5py handles everything under the covers.
To demonstrate, I created a "simple" example (realizing virtual datasets aren't "simple"). That said, if you can figure out region references, you can figure out virtual datasets. Here is a link to the h5py Virtual Dataset Documentation and Example for details. Here is a short summary of the process:
- Define the virtual layout: this is the shape and dtype of the virtual dataset that will point to other datasets.
- Define the virtual sources. Each is a reference to a HDF5 file and dataset (1 virtual source for file/dataset combination.)
- Map virtual source data to the virtual layout (you can use slice notation, which is shown in my example).
- Repeat steps 2 and 3 for all sources (or slices of sources)
Note: virtual datasets can be in a separate file, or in the same file as the referenced datasets. I will show both in the example. (Once you have defined the layout and sources, both methods are equally easy.)
There are at least 3 other SO questions and answers on this topic:
- h5py, enums, and VirtualLayout
- h5py error reading virtual dataset into NumPy array
- How to combine multiple hdf5 files into one file and dataset?
Example follows:
Step 1: Create some example data. Without your schema, I guessed at how you stored "standard rate" and "burst rate" data. All standard rate data is stored in dataset 'data_log'
and each burst is stored in a separate dataset named: 'burst_log_##'
.
import numpy as np
import h5py
log_ntimes = 31
log_inc = 1e-3
arr = np.zeros((log_ntimes,2))
for i in range(log_ntimes):
time = i*log_inc
arr[i,0] = time
#temp = 70. 100.*time
#print(f'For Time = {time:.5f} ; Temp= {temp:.4f}')
arr[:,1] = 70. 100.*arr[:,0]
#print(arr)
with h5py.File('SO_72654160.h5','w') as h5f:
h5f.create_dataset('data_log',data=arr)
n_bursts = 4
burst_ntimes = 11
burst_inc = 5e-5
for n in range(1,n_bursts):
arr = np.zeros((burst_ntimes-1,2))
for i in range(1,burst_ntimes):
burst_time = 0.01*(n)
time = burst_time i*burst_inc
arr[i-1,0] = time
#temp = 70. 100.*t
arr[:,1] = 70. 100.*arr[:,0]
with h5py.File('SO_72654160.h5','a') as h5f:
h5f.create_dataset(f'burst_log_{n:02}',data=arr)
Step 2: This is where the virtual layout and sources are defined and used to create the virtual dataset. This creates a virtual dataset a new file, and one in the existing file. (The statements are identical except for the file name and mode.)
source_file = 'SO_72654160.h5'
a0 = 0
with h5py.File(source_file, 'r') as h5f:
for ds_name in h5f:
a0 = h5f[ds_name].shape[0]
print(f'Total data rows in source = {a0}')
# alternate getting data from
# dataset: data_log, get rows 0-11, 11-21, 21-31
# datasets: burst_log_01, burst log_02, etc (each has 10 rows)
# Define virtual dataset layout
layout = h5py.VirtualLayout(shape=(a0, 2),dtype=float)
# Map virstual dataset to logged data
vsource1 = h5py.VirtualSource(source_file, 'data_log', shape=(41,2))
layout[0:11,:] = vsource1[0:11,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_01', shape=(10,2))
layout[11:21,:] = vsource2
layout[21:31,:] = vsource1[11:21,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_02', shape=(10,2))
layout[31:41,:] = vsource2
layout[41:51,:] = vsource1[21:31,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_03', shape=(10,2))
layout[51:61,:] = vsource2
# Create NEW file, then add virtual dataset
with h5py.File('SO_72654160_VDS.h5', 'w') as h5vds:
h5vds.create_virtual_dataset("vdata", layout)
print(f'Total data rows in VDS 1 = {h5vds["vdata"].shape[0]}')
# Open EXISTING file, then add virtual dataset
with h5py.File('SO_72654160.h5', 'a') as h5vds:
h5vds.create_virtual_dataset("vdata", layout)
print(f'Total data rows in VDS 2 = {h5vds["vdata"].shape[0]}')