Home > Software engineering >  Pandas: Mean of a column between change of condition in second column
Pandas: Mean of a column between change of condition in second column

Time:02-18

Say I have the following dataframe:

import pandas as pd
import numpy as np

data = np.random.randint(1, 10, size=(10,2))

df = pd.DataFrame(data, columns=['x1', 'x2'])
df['switch'] = [1,1,0,0,1,1,0,0,1,1]
index_ = pd.date_range('2022-01-17 13:00:00', periods=10, freq='5s')
df.index = index_.rename('Time')

resulting in:

                    x1  x2  switch
Time            
2022-01-17 13:00:00 2   6   1
2022-01-17 13:00:05 9   8   1
2022-01-17 13:00:10 4   9   0
2022-01-17 13:00:15 5   6   0
2022-01-17 13:00:20 4   9   1
2022-01-17 13:00:25 6   7   1
2022-01-17 13:00:30 4   6   0
2022-01-17 13:00:35 2   3   0
2022-01-17 13:00:40 4   9   1
2022-01-17 13:00:45 5   2   1

I'm looking for a way of getting the start time, end time, and means of x1 & x2 for each block where the switch value is 1.

So here (for example):

Start: 2022-01-17 13:00:00

End: 2022-01-17 13:00:05

x1 mean: 5.5

x2 mean: 7

I have no idea how to both detect the change in the switch column and then also to build the mean from the values occurring between the change and return the times at which the change in the switch column occurred.

CodePudding user response:

You could use groupby agg with a custom group:

df2 = df.reset_index()
df2['Time'] = pd.to_datetime(df2['Time'])

(df2[df2['switch'].eq(1)]               # keep only rows with switch 1
 .groupby(df2['switch'].ne(1).cumsum()) # group by consecutive 1s
 .agg({'x1': 'mean', 'x2': 'mean', 'Time': ('min', 'max')})
)

output:

         x1   x2                Time                    
       mean mean                 min                 max
switch                                                  
0       7.5  4.0 2022-01-17 13:00:00 2022-01-17 13:00:05
2       3.0  4.5 2022-01-17 13:00:20 2022-01-17 13:00:25
4       6.0  3.5 2022-01-17 13:00:40 2022-01-17 13:00:45
alternative output:
df2 = df.reset_index()
df2['Time'] = pd.to_datetime(df2['Time'])

(df2[df2['switch'].eq(1)]
 .groupby(df2['switch'].ne(1).cumsum())
 .agg(avg_x1=('x1', 'mean'),
      avg_x2=('x2', 'mean'),
      start=('Time', 'min'),
      end=('Time', 'max'))
 .reset_index(drop=True)
)

output:

   avg_x1  avg_x2               start                 end
0     7.5     4.0 2022-01-17 13:00:00 2022-01-17 13:00:05
1     3.0     4.5 2022-01-17 13:00:20 2022-01-17 13:00:25
2     6.0     3.5 2022-01-17 13:00:40 2022-01-17 13:00:45
  • Related