I have a dataset like this one:
DateTime | Value |
---|---|
2022-01-01 11:03:45 | 0 |
2022-01-01 11:03:50 | 40 |
2022-01-01 11:03:55 | 50 |
2022-01-01 11:04:00 | 60 |
2022-01-01 11:04:05 | 5 |
2022-01-01 11:04:10 | 4 |
2022-01-01 11:04:15 | 3 |
2022-01-01 11:04:20 | 0 |
2022-01-01 11:04:25 | 0 |
2022-01-01 11:04:30 | 40 |
2022-01-01 11:04:35 | 50 |
2022-01-01 11:04:40 | 4 |
2022-01-01 11:04:45 | 3 |
2022-01-01 11:04:50 | 0 |
2022-01-02 11:03:45 | 0 |
2022-01-02 11:03:50 | 5 |
2022-01-02 11:03:55 | 50 |
2022-01-02 11:04:00 | 60 |
2022-01-02 11:04:05 | 5 |
2022-01-02 11:04:10 | 4 |
2022-01-02 11:04:15 | 3 |
2022-01-02 11:04:20 | 0 |
2022-01-02 11:04:25 | 49 |
2022-01-02 11:04:30 | 40 |
2022-01-02 11:04:35 | 50 |
2022-01-02 11:04:40 | 4 |
2022-01-02 11:04:45 | 3 |
2022-01-02 11:04:50 | 0 |
as you can see I have some timestamps with values. It is a measurement of a device. It takes a sample every 5 seconds. It is only a subset of all data. There are some group with low value and high value. I define high value if it is greater then 10. If consecutive rows have high value then I consider it as a group. What I would like to achieve:
- count number of groups in day
- for each group calculate duration
I will show example of my desired result below:
DateTime | Value | GroupId | Duration (in seconds) |
---|---|---|---|
2022-01-01 11:03:45 | 0 | NaN | Nan |
2022-01-01 11:03:50 | 40 | 1 | 15 |
2022-01-01 11:03:55 | 50 | 1 | 15 |
2022-01-01 11:04:00 | 60 | 1 | 15 |
2022-01-01 11:04:05 | 5 | NaN | Nan |
2022-01-01 11:04:10 | 4 | NaN | Nan |
2022-01-01 11:04:15 | 3 | NaN | Nan |
2022-01-01 11:04:20 | 0 | NaN | Nan |
2022-01-01 11:04:25 | 0 | NaN | Nan |
2022-01-01 11:04:30 | 40 | 2 | 10 |
2022-01-01 11:04:35 | 50 | 2 | 10 |
2022-01-01 11:04:40 | 4 | NaN | Nan |
2022-01-01 11:04:45 | 3 | NaN | Nan |
2022-01-01 11:04:50 | 0 | NaN | Nan |
2022-01-02 11:03:45 | 0 | NaN | Nan |
2022-01-02 11:03:50 | 5 | NaN | Nan |
2022-01-02 11:03:55 | 50 | 1 | 10 |
2022-01-02 11:04:00 | 60 | 1 | 10 |
2022-01-02 11:04:05 | 5 | NaN | Nan |
2022-01-02 11:04:10 | 4 | NaN | Nan |
2022-01-02 11:04:15 | 3 | NaN | Nan |
2022-01-02 11:04:20 | 0 | NaN | Nan |
2022-01-02 11:04:25 | 49 | 2 | 15 |
2022-01-02 11:04:30 | 40 | 2 | 15 |
2022-01-02 11:04:35 | 50 | 2 | 15 |
2022-01-02 11:04:40 | 4 | NaN | Nan |
2022-01-02 11:04:45 | 3 | NaN | Nan |
2022-01-02 11:04:50 | 0 | NaN | Nan |
I know how to read data in Pandas and do basic manipulation, can you give me any hints on how to find those groups and how to measure their duration and assign a number to them? THanks!
CodePudding user response:
For GroupId
greate groups by consecutive values greater like 10
and aggregate cumulative sum by GroupBy.cumsum
, then per date
s and GroupId
get maximal and minimal datetime and subtract, last add 5 seconds because sample every 5 seconds:
df['DateTime'] = pd.to_datetime(df['DateTime'])
s = df['Value'].gt(10)
date = df['DateTime'].dt.date
df['GroupId'] = s.ne(s.shift())[s].groupby(date).cumsum()
g = df.groupby([date,'GroupId'])['DateTime']
df['Duration (in seconds)'] = (g.transform('max').sub(g.transform('min'))
.dt.total_seconds().add(5))
print (df)
DateTime Value GroupId Duration (in seconds)
0 2022-01-01 11:03:45 0 NaN NaN
1 2022-01-01 11:03:50 40 1.0 15.0
2 2022-01-01 11:03:55 50 1.0 15.0
3 2022-01-01 11:04:00 60 1.0 15.0
4 2022-01-01 11:04:05 5 NaN NaN
5 2022-01-01 11:04:10 4 NaN NaN
6 2022-01-01 11:04:15 3 NaN NaN
7 2022-01-01 11:04:20 0 NaN NaN
8 2022-01-01 11:04:25 0 NaN NaN
9 2022-01-01 11:04:30 40 2.0 10.0
10 2022-01-01 11:04:35 50 2.0 10.0
11 2022-01-01 11:04:40 4 NaN NaN
12 2022-01-01 11:04:45 3 NaN NaN
13 2022-01-01 11:04:50 0 NaN NaN
14 2022-01-02 11:03:45 0 NaN NaN
15 2022-01-02 11:03:50 5 NaN NaN
16 2022-01-02 11:03:55 50 1.0 10.0
17 2022-01-02 11:04:00 60 1.0 10.0
18 2022-01-02 11:04:05 5 NaN NaN
19 2022-01-02 11:04:10 4 NaN NaN
20 2022-01-02 11:04:15 3 NaN NaN
21 2022-01-02 11:04:20 0 NaN NaN
22 2022-01-02 11:04:25 49 2.0 15.0
23 2022-01-02 11:04:30 40 2.0 15.0
24 2022-01-02 11:04:35 50 2.0 15.0
25 2022-01-02 11:04:40 4 NaN NaN
26 2022-01-02 11:04:45 3 NaN NaN
27 2022-01-02 11:04:50 0 NaN NaN
Another idea for count Duration by previous matched value per groups:
df['DateTime'] = pd.to_datetime(df['DateTime'])
s = df['Value'].gt(10)
date = df['DateTime'].dt.date
df['GroupId'] = s.ne(s.shift())[s].groupby(date).cumsum()
prev = df.groupby(date)['GroupId'].bfill(limit=1)
g = df.groupby([date,prev])['DateTime']
df['Duration (in seconds)'] = (g.transform('max').sub(g.transform('min'))
.dt.total_seconds()
.where(s))
print (df)
DateTime Value GroupId Duration (in seconds)
0 2022-01-01 11:03:45 0 NaN NaN
1 2022-01-01 11:03:50 40 1.0 15.0
2 2022-01-01 11:03:55 50 1.0 15.0
3 2022-01-01 11:04:00 60 1.0 15.0
4 2022-01-01 11:04:05 5 NaN NaN
5 2022-01-01 11:04:10 4 NaN NaN
6 2022-01-01 11:04:15 3 NaN NaN
7 2022-01-01 11:04:20 0 NaN NaN
8 2022-01-01 11:04:25 0 NaN NaN
9 2022-01-01 11:04:30 40 2.0 10.0
10 2022-01-01 11:04:35 50 2.0 10.0
11 2022-01-01 11:04:40 4 NaN NaN
12 2022-01-01 11:04:45 3 NaN NaN
13 2022-01-01 11:04:50 0 NaN NaN
14 2022-01-02 11:03:45 0 NaN NaN
15 2022-01-02 11:03:50 5 NaN NaN
16 2022-01-02 11:03:55 50 1.0 10.0
17 2022-01-02 11:04:00 60 1.0 10.0
18 2022-01-02 11:04:05 5 NaN NaN
19 2022-01-02 11:04:10 4 NaN NaN
20 2022-01-02 11:04:15 3 NaN NaN
21 2022-01-02 11:04:20 0 NaN NaN
22 2022-01-02 11:04:25 49 2.0 15.0
23 2022-01-02 11:04:30 40 2.0 15.0
24 2022-01-02 11:04:35 50 2.0 15.0
25 2022-01-02 11:04:40 4 NaN NaN
26 2022-01-02 11:04:45 3 NaN NaN
27 2022-01-02 11:04:50 0 NaN NaN