count groups of values with aggregated value-CodePudding

I have a dataset like this one:

DateTime	Value
2022-01-01 11:03:45	0
2022-01-01 11:03:50	40
2022-01-01 11:03:55	50
2022-01-01 11:04:00	60
2022-01-01 11:04:05	5
2022-01-01 11:04:10	4
2022-01-01 11:04:15	3
2022-01-01 11:04:20	0
2022-01-01 11:04:25	0
2022-01-01 11:04:30	40
2022-01-01 11:04:35	50
2022-01-01 11:04:40	4
2022-01-01 11:04:45	3
2022-01-01 11:04:50	0
2022-01-02 11:03:45	0
2022-01-02 11:03:50	5
2022-01-02 11:03:55	50
2022-01-02 11:04:00	60
2022-01-02 11:04:05	5
2022-01-02 11:04:10	4
2022-01-02 11:04:15	3
2022-01-02 11:04:20	0
2022-01-02 11:04:25	49
2022-01-02 11:04:30	40
2022-01-02 11:04:35	50
2022-01-02 11:04:40	4
2022-01-02 11:04:45	3
2022-01-02 11:04:50	0

as you can see I have some timestamps with values. It is a measurement of a device. It takes a sample every 5 seconds. It is only a subset of all data. There are some group with low value and high value. I define high value if it is greater then 10. If consecutive rows have high value then I consider it as a group. What I would like to achieve:

count number of groups in day
for each group calculate duration

I will show example of my desired result below:

DateTime	Value	GroupId	Duration (in seconds)
2022-01-01 11:03:45	0	NaN	Nan
2022-01-01 11:03:50	40	1	15
2022-01-01 11:03:55	50	1	15
2022-01-01 11:04:00	60	1	15
2022-01-01 11:04:05	5	NaN	Nan
2022-01-01 11:04:10	4	NaN	Nan
2022-01-01 11:04:15	3	NaN	Nan
2022-01-01 11:04:20	0	NaN	Nan
2022-01-01 11:04:25	0	NaN	Nan
2022-01-01 11:04:30	40	2	10
2022-01-01 11:04:35	50	2	10
2022-01-01 11:04:40	4	NaN	Nan
2022-01-01 11:04:45	3	NaN	Nan
2022-01-01 11:04:50	0	NaN	Nan
2022-01-02 11:03:45	0	NaN	Nan
2022-01-02 11:03:50	5	NaN	Nan
2022-01-02 11:03:55	50	1	10
2022-01-02 11:04:00	60	1	10
2022-01-02 11:04:05	5	NaN	Nan
2022-01-02 11:04:10	4	NaN	Nan
2022-01-02 11:04:15	3	NaN	Nan
2022-01-02 11:04:20	0	NaN	Nan
2022-01-02 11:04:25	49	2	15
2022-01-02 11:04:30	40	2	15
2022-01-02 11:04:35	50	2	15
2022-01-02 11:04:40	4	NaN	Nan
2022-01-02 11:04:45	3	NaN	Nan
2022-01-02 11:04:50	0	NaN	Nan

I know how to read data in Pandas and do basic manipulation, can you give me any hints on how to find those groups and how to measure their duration and assign a number to them? THanks!

CodePudding user response：

For GroupId greate groups by consecutive values greater like 10 and aggregate cumulative sum by GroupBy.cumsum, then per dates and GroupId get maximal and minimal datetime and subtract, last add 5 seconds because sample every 5 seconds:

df['DateTime'] = pd.to_datetime(df['DateTime'])
s =  df['Value'].gt(10)
date = df['DateTime'].dt.date

df['GroupId'] = s.ne(s.shift())[s].groupby(date).cumsum()

g = df.groupby([date,'GroupId'])['DateTime']

df['Duration (in seconds)'] = (g.transform('max').sub(g.transform('min'))
                                .dt.total_seconds().add(5))

print (df)
              DateTime  Value  GroupId  Duration (in seconds)
0  2022-01-01 11:03:45      0      NaN                    NaN
1  2022-01-01 11:03:50     40      1.0                   15.0
2  2022-01-01 11:03:55     50      1.0                   15.0
3  2022-01-01 11:04:00     60      1.0                   15.0
4  2022-01-01 11:04:05      5      NaN                    NaN
5  2022-01-01 11:04:10      4      NaN                    NaN
6  2022-01-01 11:04:15      3      NaN                    NaN
7  2022-01-01 11:04:20      0      NaN                    NaN
8  2022-01-01 11:04:25      0      NaN                    NaN
9  2022-01-01 11:04:30     40      2.0                   10.0
10 2022-01-01 11:04:35     50      2.0                   10.0
11 2022-01-01 11:04:40      4      NaN                    NaN
12 2022-01-01 11:04:45      3      NaN                    NaN
13 2022-01-01 11:04:50      0      NaN                    NaN
14 2022-01-02 11:03:45      0      NaN                    NaN
15 2022-01-02 11:03:50      5      NaN                    NaN
16 2022-01-02 11:03:55     50      1.0                   10.0
17 2022-01-02 11:04:00     60      1.0                   10.0
18 2022-01-02 11:04:05      5      NaN                    NaN
19 2022-01-02 11:04:10      4      NaN                    NaN
20 2022-01-02 11:04:15      3      NaN                    NaN
21 2022-01-02 11:04:20      0      NaN                    NaN
22 2022-01-02 11:04:25     49      2.0                   15.0
23 2022-01-02 11:04:30     40      2.0                   15.0
24 2022-01-02 11:04:35     50      2.0                   15.0
25 2022-01-02 11:04:40      4      NaN                    NaN
26 2022-01-02 11:04:45      3      NaN                    NaN
27 2022-01-02 11:04:50      0      NaN                    NaN

Another idea for count Duration by previous matched value per groups:

df['DateTime'] = pd.to_datetime(df['DateTime'])
s =  df['Value'].gt(10)
date = df['DateTime'].dt.date

df['GroupId'] = s.ne(s.shift())[s].groupby(date).cumsum()

prev = df.groupby(date)['GroupId'].bfill(limit=1)
g = df.groupby([date,prev])['DateTime']

df['Duration (in seconds)'] = (g.transform('max').sub(g.transform('min'))
                                .dt.total_seconds()
                                .where(s))

print (df)
              DateTime  Value  GroupId  Duration (in seconds)
0  2022-01-01 11:03:45      0      NaN                    NaN
1  2022-01-01 11:03:50     40      1.0                   15.0
2  2022-01-01 11:03:55     50      1.0                   15.0
3  2022-01-01 11:04:00     60      1.0                   15.0
4  2022-01-01 11:04:05      5      NaN                    NaN
5  2022-01-01 11:04:10      4      NaN                    NaN
6  2022-01-01 11:04:15      3      NaN                    NaN
7  2022-01-01 11:04:20      0      NaN                    NaN
8  2022-01-01 11:04:25      0      NaN                    NaN
9  2022-01-01 11:04:30     40      2.0                   10.0
10 2022-01-01 11:04:35     50      2.0                   10.0
11 2022-01-01 11:04:40      4      NaN                    NaN
12 2022-01-01 11:04:45      3      NaN                    NaN
13 2022-01-01 11:04:50      0      NaN                    NaN
14 2022-01-02 11:03:45      0      NaN                    NaN
15 2022-01-02 11:03:50      5      NaN                    NaN
16 2022-01-02 11:03:55     50      1.0                   10.0
17 2022-01-02 11:04:00     60      1.0                   10.0
18 2022-01-02 11:04:05      5      NaN                    NaN
19 2022-01-02 11:04:10      4      NaN                    NaN
20 2022-01-02 11:04:15      3      NaN                    NaN
21 2022-01-02 11:04:20      0      NaN                    NaN
22 2022-01-02 11:04:25     49      2.0                   15.0
23 2022-01-02 11:04:30     40      2.0                   15.0
24 2022-01-02 11:04:35     50      2.0                   15.0
25 2022-01-02 11:04:40      4      NaN                    NaN
26 2022-01-02 11:04:45      3      NaN                    NaN
27 2022-01-02 11:04:50      0      NaN                    NaN