Home > Enterprise >  Transformation of time series data arranged as value per column into Pandas 2D Dataframe
Transformation of time series data arranged as value per column into Pandas 2D Dataframe

Time:07-26

I have data where 13000 high frequency time-series data points are mapped as one column per point assosiated with one timestamp, when the data was measured. I read this data from an influxDB and get a Pandas.DataFrame structured as follows:

acquisition time stamp (DateTimeIndex), datapoint1 (int), ..., datapointN (int)

I know the sampling frequency of the high frequency data (10 MHz), so each datapoint sample has a length of 100 nanoseconds. With the given time stamp of the acquisition, I can calculate back to give each of the data points a time stamp, spaced 100 nanoseconds.

In total I have hundreds of tousands of rows like above which I would like to convert in a nanosecond-precision time series which should look like follows:

time stamp 1 (DateTimeIndex), acquisition stamp 1, datapoint1,1
time stamp 2 (DateTimeIndex), acquisition stamp 1, datapoint1,2
...
time stamp N (DateTimeIndex), acquisition stamp 1, datapoint1,N
time stamp N   1 (DateTimeIndex), acquisition stamp 2, datapoint2,1
time stamp N   2 (DateTimeIndex), acquisition stamp 2, datapoint2,2
...
time stamp 2*N (DateTimeIndex), acquisition stamp 2, datapoint2,N
...
...
time stamp M*N (DateTimeIndex), acquisition stamp M, datapoint2,N

Where N is the number of data points (= number of columns in the original frame = 13000) an M is the number of rows (= individial measurement data sets, each with 13000 data points)

Currently, my code looks as follows:

out_data = DataFrame()
out_data = data_from_influx.apply(lambda item: processRawdataColumns(item, sampling_frequency, out_data))

with

def processRawdataColumns(raw_data: Series, sampling_frequency: int, result_frame: DataFrame) -> DataFrame:
    """
    Method transforming raw data organized as indexed array into timeseries data, assuming that the initial
    acquisition timestamp equals the time stamp of the last sample in the series
    :param raw_data: initial data, expected is a pandas.Series where Series.name = acquisition timestamp
    :param sampling_frequency: the sampling frequency of time-resolved raw data
    :return: pandas.DataFrame with time stamped raw data
    """

    try:
        acq_timestamp = raw_data.name.value
        processed_data: DataFrame = raw_data.to_frame()
        processed_data = processed_data.reset_index()
        processed_data['index'] = processed_data['index'].apply(lambda item: int(item.replace('rawdata', '').lstrip(
            '0')))
        processed_data['acqtimestamp'] = raw_data.name
        processed_data['time'] = processed_data['index'] \
            .apply(lambda index: acq_timestamp - int((len(raw_data) - index - 1) * 1E9 / sampling_frequency))
        processed_data = processed_data.drop(columns=['index'])
        processed_data['time'] = pd.to_datetime(processed_data['time'])
        processed_data = processed_data.rename(columns={raw_data.name: 'rawdata'})
        processed_data = processed_data.set_index('time')
        result_frame = result_frame.append(processed_data)
        return result_frame
    except Exception as err:
        print(err)
        return DataFrame(index=['time'])

However, the data is not properly structured. I get an output_frame structured by M rows, each with one DataFrame containing a properly transformed high-frequency data time-series. How can I get to the "flat" structure of time indexed rows with one value and a time stamp?

As sample data it is enough to consider data like this (sample_frequency = 1E7):

                       acq_stamp  rawdata000001  rawdata000002  rawdata000003  rawdata000004
0  2022-05-15T21:00:02.660160000              1              2              3              4
1  2022-05-15T21:00:04.660160000              5              6              7              8

This should become

                      time stamp                      acq_stamp  value
0  2022-05-15T21:00:02.660159700  2022-05-15T21:00:02.660160000      1
1  2022-05-15T21:00:02.660159800  2022-05-15T21:00:02.660160000      2
2  2022-05-15T21:00:02.660159900  2022-05-15T21:00:02.660160000      3
3  2022-05-15T21:00:02.660160000  2022-05-15T21:00:02.660160000      4
4  2022-05-15T21:00:04.660159700  2022-05-15T21:00:04.660160000      5
5  2022-05-15T21:00:04.660159800  2022-05-15T21:00:04.660160000      6
6  2022-05-15T21:00:04.660159900  2022-05-15T21:00:04.660160000      7
7  2022-05-15T21:00:04.660160000  2022-05-15T21:00:04.660160000      8

CodePudding user response:

I'm not sure I'm fully understanding the rebasing logic, but this does lead to from your sample input to your sample output.

# Convert to pandas datetime.
df.acq_stamp = pd.to_datetime(df.acq_stamp)

# Melt your dataframe, keeping the acq_stamp column.
df = df.melt('acq_stamp', var_name='rawdata')

# Get the numerical value out of the rawdata column name.
df.rawdata = df.rawdata.str.extract('(\d )').astype(int)

# Find the absolute difference between each rawdata point and the max rawdata point,
# Make this into units of 100ns and subtract it from the acq_stamp.
rawdiff_as_ns = df.rawdata.sub(df.rawdata.max()).abs().mul(100)
df['timestamp'] = df.acq_stamp.sub(pd.to_timedelta(rawdiff_as_ns, unit='ns'))

# Sort your data.
df = df.sort_values('timestamp', ignore_index=True)

# Outputting just the columns you wanted:
print(df[['timestamp', 'acq_stamp', 'value']])

Output:

                      timestamp                  acq_stamp  value
0 2022-05-15 21:00:02.660159700 2022-05-15 21:00:02.660160      1
1 2022-05-15 21:00:02.660159800 2022-05-15 21:00:02.660160      2
2 2022-05-15 21:00:02.660159900 2022-05-15 21:00:02.660160      3
3 2022-05-15 21:00:02.660160000 2022-05-15 21:00:02.660160      4
4 2022-05-15 21:00:04.660159700 2022-05-15 21:00:04.660160      5
5 2022-05-15 21:00:04.660159800 2022-05-15 21:00:04.660160      6
6 2022-05-15 21:00:04.660159900 2022-05-15 21:00:04.660160      7
7 2022-05-15 21:00:04.660160000 2022-05-15 21:00:04.660160      8
  • Related