Home > Net >  Aggregate a column element based on the time limit provided in pandas
Aggregate a column element based on the time limit provided in pandas


Given a dataframe, I need to find the sum of elements in the column 'data' where the rows having local time in between the local_time and the end_time. The end_time is always local_time 30 seconds.

Also, if the aggregated value is greater than or equal to 4 then I need to update the flags of the rows having 'local time'<=local time<='end_time' to 1, grouping by the ID column.


    ID            local_time              end_time  data  Flag
0   aa  2022-01-02  10:01:06  2022-01-02  10:01:36     0     0
1   aa  2022-01-02  10:01:12  2022-01-02  10:01:42     0     0
2   aa  2022-01-02  10:01:18  2022-01-02  10:01:48     1     0
3   aa  2022-01-02  10:01:24  2022-01-02  10:01:54     1     0
4   aa  2022-01-02  10:01:30  2022-01-02  10:02:00     1     0
5   aa  2022-01-02  10:01:36  2022-01-02  10:02:06     1     0
6   aa  2022-01-02  10:01:42  2022-01-02  10:02:12     1     0
7   aa  2022-01-02  10:01:48  2022-01-02  10:02:18     0     0
8   aa  2022-01-02  10:01:54  2022-01-02  10:02:24     1     0
9   aa  2022-01-02  10:02:00  2022-01-02  10:02:30     0     0
10  aa  2022-01-02  10:02:06  2022-01-02  10:02:36     0     0


    Agg  Flag_new
0     4         1
1     5         1
2     5         1
3     5         1
4     4         1
5     3         1
6     3         1
7     2         1
8     2         1
9     1         1
10    1         0

Below code does the job, but it takes too much time:

aggr_data_sum = []
df['flag'] = 0

for i in range(len(df)):
    id = df['ID'].iloc[i]
    st_time = df['local_time'].iloc[i]
    end_time = df['end_time'].iloc[i]
    temp = df.loc[(df['ID'] == id) & (df['local_time'] >= st_time) & (df['local_time'] <= end_time)]
    select_indices = list(temp.index)

    if aggr_sum>=4:

df['aggr_data_sum']= aggr_data_sum


CodePudding user response:


You can use merge and query to filter out rows according your condition then groupby_sum to aggregate data. Finally use join (or concat) to concatenate dataframes:

offset = pd.Timedelta(seconds=30)
filter_local_time = lambda x: x['local_time_y'].between(x['local_time_x'], x['local_time_x']   offset)

df1 = df1.join(df1[['ID', 'local_time']].reset_index().merge(df1, on='ID')
                  .assign(Flag_new=lambda x: x['Agg'].ge(4).astype(int)))


>>> df1
    ID          local_time  data  Agg  Flag_new
0   aa 2022-01-02 10:01:06     0    4         1
1   aa 2022-01-02 10:01:12     0    5         1
2   aa 2022-01-02 10:01:18     1    5         1
3   aa 2022-01-02 10:01:24     1    5         1
4   aa 2022-01-02 10:01:30     1    4         1
5   aa 2022-01-02 10:01:36     1    4         1
6   aa 2022-01-02 10:01:42     1    3         0
7   aa 2022-01-02 10:01:48     0    2         0
8   aa 2022-01-02 10:01:54     1    2         0
9   aa 2022-01-02 10:02:00     0    1         0
10  aa 2022-01-02 10:02:06     1    1         0

Note: for the input dataframe, I only keep ID, local_time and data columns.

CodePudding user response:

Since we only need to consider the 30 seconds following each local_time, we can use pd.rolling with a time window of 30s. This needs to be done on each group seperatly (as indicated by the ID column), so we also need an additional groupby. Finally, to get the Flag_new, we can use a moving window again and checking the maximum Agg value`, if it's equal or above 4 then we set the flag to 1.

def group_rolling_window(x):
    x['Agg'] = x['data'][::-1].rolling('30s', closed='both', min_periods=1).sum()[::-1]
    x['Flag_new'] = x['Agg'].rolling('30s', closed='both', min_periods=1).max().ge(4).astype(int)
    return x

df['local_time'] = pd.to_datetime(df['local_time'])
df = df.set_index('local_time').groupby('ID').apply(group_rolling_window).reset_index()


            local_time  ID              end_time  data  Flag,  Agg  Flag_new
0  2022-01-02 10:01:06  aa  2022-01-02  10:01:36     0      0  4.0         1
1  2022-01-02 10:01:12  aa  2022-01-02  10:01:42     0      0  5.0         1
2  2022-01-02 10:01:18  aa  2022-01-02  10:01:48     1      0  5.0         1
3  2022-01-02 10:01:24  aa  2022-01-02  10:01:54     1      0  5.0         1
4  2022-01-02 10:01:30  aa  2022-01-02  10:02:00     1      0  4.0         1
5  2022-01-02 10:01:36  aa  2022-01-02  10:02:06     1      0  3.0         1
6  2022-01-02 10:01:42  aa  2022-01-02  10:02:12     1      0  2.0         1
7  2022-01-02 10:01:48  aa  2022-01-02  10:02:18     0      0  1.0         1
8  2022-01-02 10:01:54  aa  2022-01-02  10:02:24     1      0  1.0         1
9  2022-01-02 10:02:00  aa  2022-01-02  10:02:30     0      0  0.0         1
10 2022-01-02 10:02:06  aa  2022-01-02  10:02:36     0      0  0.0         0
  • Related