Home > Software engineering >  Create Event Indicator from 2 DataFrames of Timestamps
Create Event Indicator from 2 DataFrames of Timestamps

Time:03-11

I have two dataframes, one is the main_df that has high frequency pi data with timestamps. The other is the reference_df that is a header-level dataframe with start and end timestamps. I need to create an indicator for main_df for when individual rows fall between the start and end timestamps in the referenc_df. Any input would be appreciated.

import pandas as pd
 
#Proxy reference dataframe
reference_data = [['site a', '2021-03-05 00:00:00', '2021-03-05 23:52:00'], 
        ['site a', '2021-03-06 00:00:00', '2021-03-06 12:00:00'], 
        ['site b', '2021-04-08 20:04:00', '2021-04-08 23:00:00'],
        ['site c', '2021-04-09 04:08:00', '2021-04-09 09:52:00']] 
 
ref_df = pd.DataFrame(reference_data, columns = ['id', 'start', 'end'])
ref_df['start'] = pd.to_datetime(ref_df['start'], infer_datetime_format=True)
ref_df['end'] = pd.to_datetime(ref_df['end'], infer_datetime_format=True)


#Proxy main high frequency dataframe
main_data = [['site a', '2021-03-05 01:00:00', 10], 
        ['site a', '2021-03-05 01:01:00', 11], 
        ['site b', '2021-04-08 20:00:00', 9],
        ['site b', '2021-04-08 20:04:00', 10],
        ['site b', '2021-04-08 20:05:00', 11],
        ['site c', '2021-01-09 10:0:00', 7]] 
 
# Create the pandas DataFrame
main_df = pd.DataFrame(main_data, columns = ['id', 'timestamp', 'value'])
main_df['timestamp'] = pd.to_datetime(main_df['timestamp'], infer_datetime_format=True)

Desired DataFrame:

print(main_df)
    id      timestamp           value event_indicator
0   site a  2021-03-05 01:00:00 10    1
1   site a  2021-03-05 01:01:00 11    1
2   site b  2021-04-08 20:00:00 9     0
3   site b  2021-04-08 20:04:00 10    1
4   site b  2021-04-08 20:05:00 11    1
5   site c  2021-01-09 10:00:00 7     0

CodePudding user response:

Perform an inner join on the sites, then check if the timestamp is between any of the ranges. reset_index before the merge so you can use that to keep track of which row you're checking the range for.

s = main_df.reset_index().merge(ref_df, on='id')
s['event_indicator'] = s['timestamp'].between(s['start'], s['end']).astype(int)
# Max checks for at least 1 overlap. 
s = s.groupby('index')['event_indicator'].max()

main_df['event_indicator'] = s

       id           timestamp  value  event_indicator
0  site a 2021-03-05 01:00:00     10                1
1  site a 2021-03-05 01:01:00     11                1
2  site b 2021-04-08 20:00:00      9                0
3  site b 2021-04-08 20:04:00     10                1
4  site b 2021-04-08 20:05:00     11                1
5  site c 2021-01-09 10:00:00      7                0
  • Related