I need to create a dataframe removing over-lapping start
and end
datetimes for multiple ids
. I will use the start
and end
datetimes to aggregate values in a high frequency pandas dataframe, so I need to remove those overlapping datetimes in mst_df
.
import pandas as pd
#Proxy reference dataframe
master = [['site a', '2021-07-08 00:00:00', '2021-07-08 10:56:00'],
['site a', '2021-07-08 06:00:00', '2021-07-08 12:00:00'], #slightly overlapping
['site a', '2021-07-08 17:36:00', '2021-07-09 11:40:00'],
['site a', '2021-07-08 18:00:00', '2021-07-09 11:40:00'], #overlapping
['site a', '2021-07-09 00:00:00', '2021-07-09 05:40:00'], #overlapping
['site b', '2021-07-08 00:00:00', '2021-07-08 10:24:00'],
['site b', '2021-07-08 06:00:00', '2021-07-08 10:24:00'], #overlapping
['site b', '2021-07-08 17:32:00', '2021-07-09 11:12:00'],
['site b', '2021-07-08 18:00:00', '2021-07-09 11:12:00'], #overlapping
['site b', '2021-07-09 00:00:00', '2021-07-09 13:00:00']] #slightly overlapping
mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
Desired DataFrame:
id start end
site a 2021-07-08 00:00:00 2021-07-08 12:00:00
site a 2021-07-08 17:36:00 2021-07-09 11:40:00
site b 2021-07-08 00:00:00 2021-07-08 10:24:00
site b 2021-07-08 17:32:00 2021-07-09 13:00:00
CodePudding user response:
I don't know if pandas
has special function for this. It has Interval.overlaping() to check if two ranges overlaps (and it works evenwith datetime
) but I don't see function to merge these two ranges so it still needs own code for merging. Fortunately it is easy.
Rows are sorted by start
so rows are not overlaping when previous_end < next_start
and I use it in for
-loop.
But first I group by site
to work with every site separatelly.
Next I get first row (as previous
) and run loop with other rows (as next)
and check previous_end < next_start
.
If it is True
then I can put previous
on list of results and get next
as previous
to work with rest of rows.
If it is False
then I create new range from both rows and use it to work with rest of rows.
Finally I add previous
to list.
After processing all groups I convert all to DataFrame.
import pandas as pd
#Proxy reference dataframe
master = [
['site a', '2021-07-08 00:00:00', '2021-07-08 10:56:00'],
['site a', '2021-07-08 06:00:00', '2021-07-08 12:00:00'], # slightly overlapping
['site a', '2021-07-08 17:36:00', '2021-07-09 11:40:00'],
['site a', '2021-07-08 18:00:00', '2021-07-09 11:40:00'], # overlapping
['site a', '2021-07-09 00:00:00', '2021-07-09 05:40:00'], # overlapping
['site b', '2021-07-08 00:00:00', '2021-07-08 10:24:00'],
['site b', '2021-07-08 06:00:00', '2021-07-08 10:24:00'], # overlapping
['site b', '2021-07-08 17:32:00', '2021-07-09 11:12:00'],
['site b', '2021-07-08 18:00:00', '2021-07-09 11:12:00'], # overlapping
['site b', '2021-07-09 00:00:00', '2021-07-09 13:00:00'] # slightly overlapping
]
mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
result = []
for val, group in mst_df.groupby('id'):
# get first
prev = group.iloc[0]
for idx, item in group[1:].iterrows():
if prev['end'] < item['start']:
# not overlapping - put previous to results and use next as previous
result.append(prev)
prev = item
else:
# overlappig - create on range start, end
prev['start'] = min(prev['start'], item['start'])
prev['end'] = max(prev['end'], item['end'])
# add when there is no next item
result.append(prev)
print(pd.DataFrame(result))
Result:
id start end
0 site a 2021-07-08 00:00:00 2021-07-08 12:00:00
2 site a 2021-07-08 17:36:00 2021-07-09 11:40:00
5 site b 2021-07-08 00:00:00 2021-07-08 10:24:00
7 site b 2021-07-08 17:32:00 2021-07-09 13:00:00