Calculate mean by overlapping time intervals in one data frame with one datetime column in another d-CodePudding

How can I calculate mean by overlapping time intervals in one data frame with one datetime column in another data frame by grouping each name.

df1

df1 = (pd.DataFrame({'name': ['a', 'a', 'a', 'a', 'b'],
              'time_start': ['2000-01-01 00:01:12',
                            '2000-01-01 00:02:12',
                            '2000-01-01 00:03:12',
                            '2000-01-01 00:04:12',
                            '2000-01-01 00:05:12'],
              'time_end': ['2000-01-01 00:01:18',
                            '2000-01-01 00:02:22',
                            '2000-01-01 00:03:24',
                            '2000-01-01 00:04:40',
                            '2000-01-01 00:05:14']})
 .assign(time_start = lambda x: pd.to_datetime(x['time_start']),
        time_end = lambda x: pd.to_datetime(x['time_end'])))

df2

df2 = (pd.DataFrame({'name': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b'],
              'time': ['2000-01-01 00:01:12',
                            '2000-01-01 00:01:13',
                            '2000-01-01 00:01:14',
                            '2000-01-01 00:02:13',
                            '2000-01-01 00:02:20',
                            '2000-01-01 00:03:12',
                            '2000-01-01 00:04:12',
                            '2000-01-01 00:04:30',
                            '2000-01-01 00:05:12'],
             'values': [10,20, 30, 40,50,60,70,80, 90]})
 .assign(time = lambda x: pd.to_datetime(x['time'])))

output should be

    name    time_start  time_end    values
0   a   2000-01-01 00:01:12 2000-01-01 00:01:18 20
1   a   2000-01-01 00:02:12 2000-01-01 00:02:22 45
2   a   2000-01-01 00:03:12 2000-01-01 00:03:24 60
3   a   2000-01-01 00:04:12 2000-01-01 00:04:40 75
4   b   2000-01-01 00:05:12 2000-01-01 00:05:14 90

CodePudding user response：

Using a merge in this context is inefficient as it creates all combinations of rows per group (which would fill up the memory if the dataset is large).

For an efficient solution, use a merge_asof:

out = (pd
   .merge_asof(df2, df1, left_on='time', right_on='time_start', by='name')
   .query('time <= time_end')
   .groupby(['name', 'time_start', 'time_end'], as_index=False)['values'].mean()
)

Note that merge_asof requires that the data is sorted by key (which is the case here for time and time_start). Else, first sort the data. The intervals should also be non-overlapping (again, the case here).

If the data is not initially sorted:

out = (pd
   .merge_asof(df2.sort_values(by='time'),
               df1.sort_values(by='time_start'),
               left_on='time', right_on='time_start', by='name')
   .query('time <= time_end')
   .groupby(['name', 'time_start', 'time_end'], as_index=False)['values'].mean()
)

Output:

  name          time_start            time_end  values
0    a 2000-01-01 00:01:12 2000-01-01 00:01:18    20.0
1    a 2000-01-01 00:02:12 2000-01-01 00:02:22    45.0
2    a 2000-01-01 00:03:12 2000-01-01 00:03:24    60.0
3    a 2000-01-01 00:04:12 2000-01-01 00:04:40    75.0
4    b 2000-01-01 00:05:12 2000-01-01 00:05:14    90.0

CodePudding user response：

You can join both DataFrames first and then test overlapping by Series.between with boolean indexing, last aggregate mean:

Solution working also for overlapping intervals.

df = df1.merge(df2, on='name')
df = (df[df['time'].between(df['time_start'], df['time_end'])]
          .groupby(['name','time_start','time_end'], as_index=False)['values']
          .mean())
print (df)
  name          time_start            time_end  values
0    a 2000-01-01 00:01:12 2000-01-01 00:01:18      20
1    a 2000-01-01 00:02:12 2000-01-01 00:02:22      45
2    a 2000-01-01 00:03:12 2000-01-01 00:03:24      60
3    a 2000-01-01 00:04:12 2000-01-01 00:04:40      75
4    b 2000-01-01 00:05:12 2000-01-01 00:05:14      90