Given a dataframe, I need to find the sum of elements in the column 'data' where the rows having local time in between the local_time and the end_time. The end_time is always local_time 30 seconds.
Also, if the aggregated value is greater than or equal to 4 then I need to update the flags of the rows having 'local time'<=local time<='end_time'
to 1, grouping by the ID column.
Input:
ID local_time end_time data Flag
0 aa 2022-01-02 10:01:06 2022-01-02 10:01:36 0 0
1 aa 2022-01-02 10:01:12 2022-01-02 10:01:42 0 0
2 aa 2022-01-02 10:01:18 2022-01-02 10:01:48 1 0
3 aa 2022-01-02 10:01:24 2022-01-02 10:01:54 1 0
4 aa 2022-01-02 10:01:30 2022-01-02 10:02:00 1 0
5 aa 2022-01-02 10:01:36 2022-01-02 10:02:06 1 0
6 aa 2022-01-02 10:01:42 2022-01-02 10:02:12 1 0
7 aa 2022-01-02 10:01:48 2022-01-02 10:02:18 0 0
8 aa 2022-01-02 10:01:54 2022-01-02 10:02:24 1 0
9 aa 2022-01-02 10:02:00 2022-01-02 10:02:30 0 0
10 aa 2022-01-02 10:02:06 2022-01-02 10:02:36 0 0
Output:
Agg Flag_new
0 4 1
1 5 1
2 5 1
3 5 1
4 4 1
5 3 1
6 3 1
7 2 1
8 2 1
9 1 1
10 1 0
Below code does the job, but it takes too much time:
aggr_data_sum = []
df['flag'] = 0
for i in range(len(df)):
id = df['ID'].iloc[i]
st_time = df['local_time'].iloc[i]
end_time = df['end_time'].iloc[i]
temp = df.loc[(df['ID'] == id) & (df['local_time'] >= st_time) & (df['local_time'] <= end_time)]
select_indices = list(temp.index)
aggr_sum=sum(temp['data'])
aggr_data_sum.append(aggr_sum)
if aggr_sum>=4:
df.loc[select_indices,'flag']=1
df['aggr_data_sum']= aggr_data_sum
df[['aggr_data_sum','flag']]
CodePudding user response:
Update
You can use merge
and query
to filter out rows according your condition then groupby_sum
to aggregate data. Finally use join
(or concat
) to concatenate dataframes:
offset = pd.Timedelta(seconds=30)
filter_local_time = lambda x: x['local_time_y'].between(x['local_time_x'], x['local_time_x'] offset)
df1 = df1.join(df1[['ID', 'local_time']].reset_index().merge(df1, on='ID')
.loc[filter_local_time].groupby('index')['data'].sum().to_frame('Agg')
.assign(Flag_new=lambda x: x['Agg'].ge(4).astype(int)))
Output:
>>> df1
ID local_time data Agg Flag_new
0 aa 2022-01-02 10:01:06 0 4 1
1 aa 2022-01-02 10:01:12 0 5 1
2 aa 2022-01-02 10:01:18 1 5 1
3 aa 2022-01-02 10:01:24 1 5 1
4 aa 2022-01-02 10:01:30 1 4 1
5 aa 2022-01-02 10:01:36 1 4 1
6 aa 2022-01-02 10:01:42 1 3 0
7 aa 2022-01-02 10:01:48 0 2 0
8 aa 2022-01-02 10:01:54 1 2 0
9 aa 2022-01-02 10:02:00 0 1 0
10 aa 2022-01-02 10:02:06 1 1 0
Note: for the input dataframe, I only keep ID
, local_time
and data
columns.
CodePudding user response:
Since we only need to consider the 30 seconds following each local_time
, we can use pd.rolling
with a time window of 30s
. This needs to be done on each group seperatly (as indicated by the ID
column), so we also need an additional groupby
. Finally, to get the Flag_new
, we can use a moving window again and checking the maximum Agg
value`, if it's equal or above 4 then we set the flag to 1.
def group_rolling_window(x):
x['Agg'] = x['data'][::-1].rolling('30s', closed='both', min_periods=1).sum()[::-1]
x['Flag_new'] = x['Agg'].rolling('30s', closed='both', min_periods=1).max().ge(4).astype(int)
return x
df['local_time'] = pd.to_datetime(df['local_time'])
df = df.set_index('local_time').groupby('ID').apply(group_rolling_window).reset_index()
Result:
local_time ID end_time data Flag, Agg Flag_new
0 2022-01-02 10:01:06 aa 2022-01-02 10:01:36 0 0 4.0 1
1 2022-01-02 10:01:12 aa 2022-01-02 10:01:42 0 0 5.0 1
2 2022-01-02 10:01:18 aa 2022-01-02 10:01:48 1 0 5.0 1
3 2022-01-02 10:01:24 aa 2022-01-02 10:01:54 1 0 5.0 1
4 2022-01-02 10:01:30 aa 2022-01-02 10:02:00 1 0 4.0 1
5 2022-01-02 10:01:36 aa 2022-01-02 10:02:06 1 0 3.0 1
6 2022-01-02 10:01:42 aa 2022-01-02 10:02:12 1 0 2.0 1
7 2022-01-02 10:01:48 aa 2022-01-02 10:02:18 0 0 1.0 1
8 2022-01-02 10:01:54 aa 2022-01-02 10:02:24 1 0 1.0 1
9 2022-01-02 10:02:00 aa 2022-01-02 10:02:30 0 0 0.0 1
10 2022-01-02 10:02:06 aa 2022-01-02 10:02:36 0 0 0.0 0