I have two dataframes, one is the main_df
that has high frequency pi data with timestamps. The other is the reference_df
that is a header-level dataframe with start and end timestamps. I need to create an indicator for main_df
for when individual rows fall between the start and end timestamps in the referenc_df
. Any input would be appreciated.
import pandas as pd
#Proxy reference dataframe
reference_data = [['site a', '2021-03-05 00:00:00', '2021-03-05 23:52:00'],
['site a', '2021-03-06 00:00:00', '2021-03-06 12:00:00'],
['site b', '2021-04-08 20:04:00', '2021-04-08 23:00:00'],
['site c', '2021-04-09 04:08:00', '2021-04-09 09:52:00']]
ref_df = pd.DataFrame(reference_data, columns = ['id', 'start', 'end'])
ref_df['start'] = pd.to_datetime(ref_df['start'], infer_datetime_format=True)
ref_df['end'] = pd.to_datetime(ref_df['end'], infer_datetime_format=True)
#Proxy main high frequency dataframe
main_data = [['site a', '2021-03-05 01:00:00', 10],
['site a', '2021-03-05 01:01:00', 11],
['site b', '2021-04-08 20:00:00', 9],
['site b', '2021-04-08 20:04:00', 10],
['site b', '2021-04-08 20:05:00', 11],
['site c', '2021-01-09 10:0:00', 7]]
# Create the pandas DataFrame
main_df = pd.DataFrame(main_data, columns = ['id', 'timestamp', 'value'])
main_df['timestamp'] = pd.to_datetime(main_df['timestamp'], infer_datetime_format=True)
Desired DataFrame:
print(main_df)
id timestamp value event_indicator
0 site a 2021-03-05 01:00:00 10 1
1 site a 2021-03-05 01:01:00 11 1
2 site b 2021-04-08 20:00:00 9 0
3 site b 2021-04-08 20:04:00 10 1
4 site b 2021-04-08 20:05:00 11 1
5 site c 2021-01-09 10:00:00 7 0
CodePudding user response:
Perform an inner join on the sites, then check if the timestamp is between any of the ranges. reset_index
before the merge so you can use that to keep track of which row you're checking the range for.
s = main_df.reset_index().merge(ref_df, on='id')
s['event_indicator'] = s['timestamp'].between(s['start'], s['end']).astype(int)
# Max checks for at least 1 overlap.
s = s.groupby('index')['event_indicator'].max()
main_df['event_indicator'] = s
id timestamp value event_indicator
0 site a 2021-03-05 01:00:00 10 1
1 site a 2021-03-05 01:01:00 11 1
2 site b 2021-04-08 20:00:00 9 0
3 site b 2021-04-08 20:04:00 10 1
4 site b 2021-04-08 20:05:00 11 1
5 site c 2021-01-09 10:00:00 7 0