I have data where 13000 high frequency time-series data points are mapped as one column per point assosiated with one timestamp, when the data was measured. I read this data from an influxDB and get a Pandas.DataFrame structured as follows:
acquisition time stamp (DateTimeIndex), datapoint1 (int), ..., datapointN (int)
I know the sampling frequency of the high frequency data (10 MHz), so each datapoint sample has a length of 100 nanoseconds. With the given time stamp of the acquisition, I can calculate back to give each of the data points a time stamp, spaced 100 nanoseconds.
In total I have hundreds of tousands of rows like above which I would like to convert in a nanosecond-precision time series which should look like follows:
time stamp 1 (DateTimeIndex), acquisition stamp 1, datapoint1,1
time stamp 2 (DateTimeIndex), acquisition stamp 1, datapoint1,2
...
time stamp N (DateTimeIndex), acquisition stamp 1, datapoint1,N
time stamp N 1 (DateTimeIndex), acquisition stamp 2, datapoint2,1
time stamp N 2 (DateTimeIndex), acquisition stamp 2, datapoint2,2
...
time stamp 2*N (DateTimeIndex), acquisition stamp 2, datapoint2,N
...
...
time stamp M*N (DateTimeIndex), acquisition stamp M, datapoint2,N
Where N is the number of data points (= number of columns in the original frame = 13000) an M is the number of rows (= individial measurement data sets, each with 13000 data points)
Currently, my code looks as follows:
out_data = DataFrame()
out_data = data_from_influx.apply(lambda item: processRawdataColumns(item, sampling_frequency, out_data))
with
def processRawdataColumns(raw_data: Series, sampling_frequency: int, result_frame: DataFrame) -> DataFrame:
"""
Method transforming raw data organized as indexed array into timeseries data, assuming that the initial
acquisition timestamp equals the time stamp of the last sample in the series
:param raw_data: initial data, expected is a pandas.Series where Series.name = acquisition timestamp
:param sampling_frequency: the sampling frequency of time-resolved raw data
:return: pandas.DataFrame with time stamped raw data
"""
try:
acq_timestamp = raw_data.name.value
processed_data: DataFrame = raw_data.to_frame()
processed_data = processed_data.reset_index()
processed_data['index'] = processed_data['index'].apply(lambda item: int(item.replace('rawdata', '').lstrip(
'0')))
processed_data['acqtimestamp'] = raw_data.name
processed_data['time'] = processed_data['index'] \
.apply(lambda index: acq_timestamp - int((len(raw_data) - index - 1) * 1E9 / sampling_frequency))
processed_data = processed_data.drop(columns=['index'])
processed_data['time'] = pd.to_datetime(processed_data['time'])
processed_data = processed_data.rename(columns={raw_data.name: 'rawdata'})
processed_data = processed_data.set_index('time')
result_frame = result_frame.append(processed_data)
return result_frame
except Exception as err:
print(err)
return DataFrame(index=['time'])
However, the data is not properly structured. I get an output_frame
structured by M rows, each with one DataFrame containing a properly transformed high-frequency data time-series. How can I get to the "flat" structure of time indexed rows with one value and a time stamp?
As sample data it is enough to consider data like this (sample_frequency = 1E7):
acq_stamp rawdata000001 rawdata000002 rawdata000003 rawdata000004
0 2022-05-15T21:00:02.660160000 1 2 3 4
1 2022-05-15T21:00:04.660160000 5 6 7 8
This should become
time stamp acq_stamp value
0 2022-05-15T21:00:02.660159700 2022-05-15T21:00:02.660160000 1
1 2022-05-15T21:00:02.660159800 2022-05-15T21:00:02.660160000 2
2 2022-05-15T21:00:02.660159900 2022-05-15T21:00:02.660160000 3
3 2022-05-15T21:00:02.660160000 2022-05-15T21:00:02.660160000 4
4 2022-05-15T21:00:04.660159700 2022-05-15T21:00:04.660160000 5
5 2022-05-15T21:00:04.660159800 2022-05-15T21:00:04.660160000 6
6 2022-05-15T21:00:04.660159900 2022-05-15T21:00:04.660160000 7
7 2022-05-15T21:00:04.660160000 2022-05-15T21:00:04.660160000 8
CodePudding user response:
I'm not sure I'm fully understanding the rebasing logic, but this does lead to from your sample input to your sample output.
# Convert to pandas datetime.
df.acq_stamp = pd.to_datetime(df.acq_stamp)
# Melt your dataframe, keeping the acq_stamp column.
df = df.melt('acq_stamp', var_name='rawdata')
# Get the numerical value out of the rawdata column name.
df.rawdata = df.rawdata.str.extract('(\d )').astype(int)
# Find the absolute difference between each rawdata point and the max rawdata point,
# Make this into units of 100ns and subtract it from the acq_stamp.
rawdiff_as_ns = df.rawdata.sub(df.rawdata.max()).abs().mul(100)
df['timestamp'] = df.acq_stamp.sub(pd.to_timedelta(rawdiff_as_ns, unit='ns'))
# Sort your data.
df = df.sort_values('timestamp', ignore_index=True)
# Outputting just the columns you wanted:
print(df[['timestamp', 'acq_stamp', 'value']])
Output:
timestamp acq_stamp value
0 2022-05-15 21:00:02.660159700 2022-05-15 21:00:02.660160 1
1 2022-05-15 21:00:02.660159800 2022-05-15 21:00:02.660160 2
2 2022-05-15 21:00:02.660159900 2022-05-15 21:00:02.660160 3
3 2022-05-15 21:00:02.660160000 2022-05-15 21:00:02.660160 4
4 2022-05-15 21:00:04.660159700 2022-05-15 21:00:04.660160 5
5 2022-05-15 21:00:04.660159800 2022-05-15 21:00:04.660160 6
6 2022-05-15 21:00:04.660159900 2022-05-15 21:00:04.660160 7
7 2022-05-15 21:00:04.660160000 2022-05-15 21:00:04.660160 8