Having the following DF:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464
1 2022-01-01 00:52:15 1.0 0.823504
2 2022-01-01 02:00:01 1.0 0.807617
3 2022-01-01 02:37:14 1.0 0.154851
4 2022-01-01 04:44:46 1.0 0.049817
5 2022-01-01 00:03:06 2.0 0.538565
6 2022-01-01 00:04:05 2.0 0.332919
7 2022-01-01 00:04:24 2.0 0.106591
8 2022-01-01 00:05:41 2.0 0.552562
9 2022-01-01 00:05:58 2.0 0.851130
10 2022-01-01 00:06:58 2.0 0.400711
11 2022-01-01 00:08:43 2.0 0.840532
12 2022-01-01 00:08:52 2.0 0.184425
13 2022-01-01 00:12:52 2.0 0.956525
14 2022-01-01 00:15:52 2.0 0.403509
I'm attempting to bin the values for every row within a 5min interval plus add missing rows where the interval exceeds 5min, as follows:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464 \_ val mean
1 2022-01-01 00:52:15 1.0 0.823504 /
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 00:52:15
to 2022-01-01 02:00:01
-------------------------------------
2 2022-01-01 02:00:01 1.0 0.807617 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:00:01
to 2022-01-01 02:37:14
-------------------------------------
3 2022-01-01 02:37:14 1.0 0.154851 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:37:14
to 2022-01-01 04:44:46
-------------------------------------
4 2022-01-01 04:44:46 1.0 0.049817 - val
New Group
-------------------------------------
5 2022-01-01 00:03:06 2.0 0.538565 \
6 2022-01-01 00:04:05 2.0 0.332919 |
7 2022-01-01 00:04:24 2.0 0.106591 |_ val mean
8 2022-01-01 00:05:41 2.0 0.552562 |
9 2022-01-01 00:05:58 2.0 0.851130 |
10 2022-01-01 00:06:58 2.0 0.400711 /
-------------------------------------
11 2022-01-01 00:08:43 2.0 0.840532 \
12 2022-01-01 00:08:52 2.0 0.184425 |_ val mean
13 2022-01-01 00:12:52 2.0 0.956525 /
-------------------------------------
14 2022-01-01 00:15:52 2.0 0.403509 - val
As such, the resulting frame would consist of 5min mean values of val, and rows with 0 val where no activity occurred. I tried using pd.Grouper(key="timestamp", freq='5min', origin='start')
to obtain the 5min intervals, but I wasn't sure where to proceed next.
any help would be appreciated.
CodePudding user response:
Does this what you are looking for:
def process(sdf):
return (sdf.resample("5min", on="timestamp", origin=sdf.timestamp.iat[0])
.mean().fillna({"id": sdf.name, "val": 0}))
df = (df.groupby("id", as_index=False).apply(process)
.droplevel(level=0, axis=0).reset_index(drop=False))
Result:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464
1 2022-01-01 00:42:34 1.0 0.000000
2 2022-01-01 00:47:34 1.0 0.823504
3 2022-01-01 00:52:34 1.0 0.000000
4 2022-01-01 00:57:34 1.0 0.000000
5 2022-01-01 01:02:34 1.0 0.000000
6 2022-01-01 01:07:34 1.0 0.000000
7 2022-01-01 01:12:34 1.0 0.000000
8 2022-01-01 01:17:34 1.0 0.000000
9 2022-01-01 01:22:34 1.0 0.000000
10 2022-01-01 01:27:34 1.0 0.000000
11 2022-01-01 01:32:34 1.0 0.000000
12 2022-01-01 01:37:34 1.0 0.000000
13 2022-01-01 01:42:34 1.0 0.000000
14 2022-01-01 01:47:34 1.0 0.000000
15 2022-01-01 01:52:34 1.0 0.000000
16 2022-01-01 01:57:34 1.0 0.807617
17 2022-01-01 02:02:34 1.0 0.000000
18 2022-01-01 02:07:34 1.0 0.000000
19 2022-01-01 02:12:34 1.0 0.000000
20 2022-01-01 02:17:34 1.0 0.000000
21 2022-01-01 02:22:34 1.0 0.000000
22 2022-01-01 02:27:34 1.0 0.000000
23 2022-01-01 02:32:34 1.0 0.154851
24 2022-01-01 02:37:34 1.0 0.000000
25 2022-01-01 02:42:34 1.0 0.000000
26 2022-01-01 02:47:34 1.0 0.000000
27 2022-01-01 02:52:34 1.0 0.000000
28 2022-01-01 02:57:34 1.0 0.000000
29 2022-01-01 03:02:34 1.0 0.000000
30 2022-01-01 03:07:34 1.0 0.000000
31 2022-01-01 03:12:34 1.0 0.000000
32 2022-01-01 03:17:34 1.0 0.000000
33 2022-01-01 03:22:34 1.0 0.000000
34 2022-01-01 03:27:34 1.0 0.000000
35 2022-01-01 03:32:34 1.0 0.000000
36 2022-01-01 03:37:34 1.0 0.000000
37 2022-01-01 03:42:34 1.0 0.000000
38 2022-01-01 03:47:34 1.0 0.000000
39 2022-01-01 03:52:34 1.0 0.000000
40 2022-01-01 03:57:34 1.0 0.000000
41 2022-01-01 04:02:34 1.0 0.000000
42 2022-01-01 04:07:34 1.0 0.000000
43 2022-01-01 04:12:34 1.0 0.000000
44 2022-01-01 04:17:34 1.0 0.000000
45 2022-01-01 04:22:34 1.0 0.000000
46 2022-01-01 04:27:34 1.0 0.000000
47 2022-01-01 04:32:34 1.0 0.000000
48 2022-01-01 04:37:34 1.0 0.000000
49 2022-01-01 04:42:34 1.0 0.049817
50 2022-01-01 00:03:06 2.0 0.463746
51 2022-01-01 00:08:06 2.0 0.660494
52 2022-01-01 00:13:06 2.0 0.403509
But I don't understand this requirement:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464 \_ val mean
1 2022-01-01 00:52:15 1.0 0.823504 /
Those 2 timestamps are not within a 5 minutes interval?
CodePudding user response:
IIUC, you need 2 different grouping methods. One to split based on a consecutive difference exceeding 5min, and one to generate chunks of 5min from the start of the group:
# ensure datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])
# split if difference exceeds 5 min
group = df.groupby('id')['timestamp'].diff().gt('5min').cumsum()
# group by 5 min chunks
g = pd.Grouper(key="timestamp", freq='5min', origin='start')
# aggregate
(df
.groupby(['id', group, g], as_index=False)
.agg({'timestamp': 'first', 'val': 'mean'})
)
Output:
id timestamp val
0 1.0 2022-01-01 00:37:34 0.128464
1 1.0 2022-01-01 00:52:15 0.823504
2 1.0 2022-01-01 02:00:01 0.807617
3 1.0 2022-01-01 02:37:14 0.154851
4 1.0 2022-01-01 04:44:46 0.049817
5 2.0 2022-01-01 00:03:06 0.463746
6 2.0 2022-01-01 00:08:43 0.660494
7 2.0 2022-01-01 00:15:52 0.403509