Pandas - Bin rows based on time interval-CodePudding

Having the following DF:

             timestamp   id       val
0  2022-01-01 00:37:34  1.0  0.128464
1  2022-01-01 00:52:15  1.0  0.823504
2  2022-01-01 02:00:01  1.0  0.807617
3  2022-01-01 02:37:14  1.0  0.154851
4  2022-01-01 04:44:46  1.0  0.049817
5  2022-01-01 00:03:06  2.0  0.538565
6  2022-01-01 00:04:05  2.0  0.332919
7  2022-01-01 00:04:24  2.0  0.106591
8  2022-01-01 00:05:41  2.0  0.552562
9  2022-01-01 00:05:58  2.0  0.851130
10 2022-01-01 00:06:58  2.0  0.400711
11 2022-01-01 00:08:43  2.0  0.840532
12 2022-01-01 00:08:52  2.0  0.184425
13 2022-01-01 00:12:52  2.0  0.956525
14 2022-01-01 00:15:52  2.0  0.403509

I'm attempting to bin the values for every row within a 5min interval plus add missing rows where the interval exceeds 5min, as follows:

             timestamp   id       val
0  2022-01-01 00:37:34  1.0  0.128464 \_ val mean
1  2022-01-01 00:52:15  1.0  0.823504 /
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 00:52:15
to 2022-01-01 02:00:01
-------------------------------------
2  2022-01-01 02:00:01  1.0  0.807617 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:00:01
to 2022-01-01 02:37:14
-------------------------------------
3  2022-01-01 02:37:14  1.0  0.154851 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:37:14
to 2022-01-01 04:44:46
-------------------------------------
4  2022-01-01 04:44:46  1.0  0.049817 - val

              New Group
-------------------------------------
5  2022-01-01 00:03:06  2.0  0.538565 \
6  2022-01-01 00:04:05  2.0  0.332919 |
7  2022-01-01 00:04:24  2.0  0.106591 |_ val mean
8  2022-01-01 00:05:41  2.0  0.552562 |
9  2022-01-01 00:05:58  2.0  0.851130 |
10 2022-01-01 00:06:58  2.0  0.400711 /
-------------------------------------
11 2022-01-01 00:08:43  2.0  0.840532 \
12 2022-01-01 00:08:52  2.0  0.184425 |_ val mean
13 2022-01-01 00:12:52  2.0  0.956525 /
-------------------------------------
14 2022-01-01 00:15:52  2.0  0.403509 - val

As such, the resulting frame would consist of 5min mean values of val, and rows with 0 val where no activity occurred. I tried using pd.Grouper(key="timestamp", freq='5min', origin='start') to obtain the 5min intervals, but I wasn't sure where to proceed next.

any help would be appreciated.

CodePudding user response：

Does this what you are looking for:

def process(sdf):
    return (sdf.resample("5min", on="timestamp", origin=sdf.timestamp.iat[0])
               .mean().fillna({"id": sdf.name, "val": 0}))

df = (df.groupby("id", as_index=False).apply(process)
        .droplevel(level=0, axis=0).reset_index(drop=False))

Result:

             timestamp   id       val
0  2022-01-01 00:37:34  1.0  0.128464
1  2022-01-01 00:42:34  1.0  0.000000
2  2022-01-01 00:47:34  1.0  0.823504
3  2022-01-01 00:52:34  1.0  0.000000
4  2022-01-01 00:57:34  1.0  0.000000
5  2022-01-01 01:02:34  1.0  0.000000
6  2022-01-01 01:07:34  1.0  0.000000
7  2022-01-01 01:12:34  1.0  0.000000
8  2022-01-01 01:17:34  1.0  0.000000
9  2022-01-01 01:22:34  1.0  0.000000
10 2022-01-01 01:27:34  1.0  0.000000
11 2022-01-01 01:32:34  1.0  0.000000
12 2022-01-01 01:37:34  1.0  0.000000
13 2022-01-01 01:42:34  1.0  0.000000
14 2022-01-01 01:47:34  1.0  0.000000
15 2022-01-01 01:52:34  1.0  0.000000
16 2022-01-01 01:57:34  1.0  0.807617
17 2022-01-01 02:02:34  1.0  0.000000
18 2022-01-01 02:07:34  1.0  0.000000
19 2022-01-01 02:12:34  1.0  0.000000
20 2022-01-01 02:17:34  1.0  0.000000
21 2022-01-01 02:22:34  1.0  0.000000
22 2022-01-01 02:27:34  1.0  0.000000
23 2022-01-01 02:32:34  1.0  0.154851
24 2022-01-01 02:37:34  1.0  0.000000
25 2022-01-01 02:42:34  1.0  0.000000
26 2022-01-01 02:47:34  1.0  0.000000
27 2022-01-01 02:52:34  1.0  0.000000
28 2022-01-01 02:57:34  1.0  0.000000
29 2022-01-01 03:02:34  1.0  0.000000
30 2022-01-01 03:07:34  1.0  0.000000
31 2022-01-01 03:12:34  1.0  0.000000
32 2022-01-01 03:17:34  1.0  0.000000
33 2022-01-01 03:22:34  1.0  0.000000
34 2022-01-01 03:27:34  1.0  0.000000
35 2022-01-01 03:32:34  1.0  0.000000
36 2022-01-01 03:37:34  1.0  0.000000
37 2022-01-01 03:42:34  1.0  0.000000
38 2022-01-01 03:47:34  1.0  0.000000
39 2022-01-01 03:52:34  1.0  0.000000
40 2022-01-01 03:57:34  1.0  0.000000
41 2022-01-01 04:02:34  1.0  0.000000
42 2022-01-01 04:07:34  1.0  0.000000
43 2022-01-01 04:12:34  1.0  0.000000
44 2022-01-01 04:17:34  1.0  0.000000
45 2022-01-01 04:22:34  1.0  0.000000
46 2022-01-01 04:27:34  1.0  0.000000
47 2022-01-01 04:32:34  1.0  0.000000
48 2022-01-01 04:37:34  1.0  0.000000
49 2022-01-01 04:42:34  1.0  0.049817
50 2022-01-01 00:03:06  2.0  0.463746
51 2022-01-01 00:08:06  2.0  0.660494
52 2022-01-01 00:13:06  2.0  0.403509

But I don't understand this requirement:

             timestamp   id       val
0  2022-01-01 00:37:34  1.0  0.128464 \_ val mean
1  2022-01-01 00:52:15  1.0  0.823504 /

Those 2 timestamps are not within a 5 minutes interval?

CodePudding user response：

IIUC, you need 2 different grouping methods. One to split based on a consecutive difference exceeding 5min, and one to generate chunks of 5min from the start of the group:

# ensure datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# split if difference exceeds 5 min
group = df.groupby('id')['timestamp'].diff().gt('5min').cumsum()

# group by 5 min chunks
g = pd.Grouper(key="timestamp", freq='5min', origin='start')

# aggregate
(df
 .groupby(['id', group, g], as_index=False)
 .agg({'timestamp': 'first', 'val': 'mean'})
)

Output:

    id           timestamp       val
0  1.0 2022-01-01 00:37:34  0.128464
1  1.0 2022-01-01 00:52:15  0.823504
2  1.0 2022-01-01 02:00:01  0.807617
3  1.0 2022-01-01 02:37:14  0.154851
4  1.0 2022-01-01 04:44:46  0.049817
5  2.0 2022-01-01 00:03:06  0.463746
6  2.0 2022-01-01 00:08:43  0.660494
7  2.0 2022-01-01 00:15:52  0.403509