How can I calculate mean by overlapping time intervals in one data frame with one datetime column in another data frame by grouping each name.
df1
df1 = (pd.DataFrame({'name': ['a', 'a', 'a', 'a', 'b'],
'time_start': ['2000-01-01 00:01:12',
'2000-01-01 00:02:12',
'2000-01-01 00:03:12',
'2000-01-01 00:04:12',
'2000-01-01 00:05:12'],
'time_end': ['2000-01-01 00:01:18',
'2000-01-01 00:02:22',
'2000-01-01 00:03:24',
'2000-01-01 00:04:40',
'2000-01-01 00:05:14']})
.assign(time_start = lambda x: pd.to_datetime(x['time_start']),
time_end = lambda x: pd.to_datetime(x['time_end'])))
df2
df2 = (pd.DataFrame({'name': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b'],
'time': ['2000-01-01 00:01:12',
'2000-01-01 00:01:13',
'2000-01-01 00:01:14',
'2000-01-01 00:02:13',
'2000-01-01 00:02:20',
'2000-01-01 00:03:12',
'2000-01-01 00:04:12',
'2000-01-01 00:04:30',
'2000-01-01 00:05:12'],
'values': [10,20, 30, 40,50,60,70,80, 90]})
.assign(time = lambda x: pd.to_datetime(x['time'])))
output should be
name time_start time_end values
0 a 2000-01-01 00:01:12 2000-01-01 00:01:18 20
1 a 2000-01-01 00:02:12 2000-01-01 00:02:22 45
2 a 2000-01-01 00:03:12 2000-01-01 00:03:24 60
3 a 2000-01-01 00:04:12 2000-01-01 00:04:40 75
4 b 2000-01-01 00:05:12 2000-01-01 00:05:14 90
CodePudding user response:
Using a merge
in this context is inefficient as it creates all combinations of rows per group (which would fill up the memory if the dataset is large).
For an efficient solution, use a merge_asof
:
out = (pd
.merge_asof(df2, df1, left_on='time', right_on='time_start', by='name')
.query('time <= time_end')
.groupby(['name', 'time_start', 'time_end'], as_index=False)['values'].mean()
)
Note that merge_asof
requires that the data is sorted by key (which is the case here for time
and time_start
). Else, first sort the data. The intervals should also be non-overlapping (again, the case here).
If the data is not initially sorted:
out = (pd
.merge_asof(df2.sort_values(by='time'),
df1.sort_values(by='time_start'),
left_on='time', right_on='time_start', by='name')
.query('time <= time_end')
.groupby(['name', 'time_start', 'time_end'], as_index=False)['values'].mean()
)
Output:
name time_start time_end values
0 a 2000-01-01 00:01:12 2000-01-01 00:01:18 20.0
1 a 2000-01-01 00:02:12 2000-01-01 00:02:22 45.0
2 a 2000-01-01 00:03:12 2000-01-01 00:03:24 60.0
3 a 2000-01-01 00:04:12 2000-01-01 00:04:40 75.0
4 b 2000-01-01 00:05:12 2000-01-01 00:05:14 90.0
CodePudding user response:
You can join both DataFrames first and then test overlapping by Series.between
with boolean indexing
, last aggregate mean
:
Solution working also for overlapping intervals.
df = df1.merge(df2, on='name')
df = (df[df['time'].between(df['time_start'], df['time_end'])]
.groupby(['name','time_start','time_end'], as_index=False)['values']
.mean())
print (df)
name time_start time_end values
0 a 2000-01-01 00:01:12 2000-01-01 00:01:18 20
1 a 2000-01-01 00:02:12 2000-01-01 00:02:22 45
2 a 2000-01-01 00:03:12 2000-01-01 00:03:24 60
3 a 2000-01-01 00:04:12 2000-01-01 00:04:40 75
4 b 2000-01-01 00:05:12 2000-01-01 00:05:14 90