Cumulative sum and substract over time when condition are met-CodePudding

I have a dataframe with IOT sensors data. What I'm trying to get a humidity score over time when a condition is met. Let's say that my df looks like this:

Day       Flat    Humidity
31/12     1       False
01/01     1       True
02/01     1       True
03/01     1       True
04/01     1       True
05/01     1       False
06/01     1       False
07/01     1       True
08/01     1       True
09/01     1       False
10/01     1       False
11/01     1       False
12/01     1       False
13/01     1       False
14/01     1       False
15/01     1       True
01/01     2       False
02/01     2       False
03/01     2       True
04/01     2       False

What I would like to get is the humidity score which is the cumulative sum of when Humidity = True. The thing I do not manage to get is when Humidity = False, I do not want the sum to start from 0 but subtract 1 from the previous value when it's not equal to 0.

In my example, it should look like this:

Day       Flat    Humidity    Score
31/12     1       False       0
01/01     1       True        1
02/01     1       True        2
03/01     1       True        3
04/01     1       True        4
05/01     1       False       3
06/01     1       False       2
07/01     1       True        3
08/01     1       True        4
09/01     1       False       3
10/01     1       False       2
11/01     1       False       1
12/01     1       False       0
13/01     1       False       0
14/01     1       False       0
15/01     1       True        1
01/01     2       False       0
02/01     2       False       0
03/01     2       True        1
04/01     2       False       0

As you can see, on the 05/01 when the Humidity = False, I do not want the count to start from 0 but be the number of previous row - 1; in my case 4-1 = 3. Also, this needs to be computed by flat and the score cannot be a value below 0.

Any help would be greatly appreciated.

CodePudding user response：

You can use a cumsum per group, and then correct for the points that went below the minimum:

s = df['Humidity'].mul(2).sub(1).groupby(df['Flat']).cumsum()
mask = s.mask(df['Humidity']|s.gt(0))
corr = mask.groupby(df['Flat']).ffill().fillna(0, downcast='infer')
df['Score'] = s-corr

output (with intermediates):

      Day  Flat  Humidity  expected  raw_score  mask  correction  Score
0   01/01     1      True         1          1   NaN           0      1
1   02/01     1      True         2          2   NaN           0      2
2   03/01     1      True         3          3   NaN           0      3
3   04/01     1      True         4          4   NaN           0      4
4   05/01     1     False         3          3   NaN           0      3
5   06/01     1     False         2          2   NaN           0      2
6   07/01     1      True         3          3   NaN           0      3
7   08/01     1      True         4          4   NaN           0      4
8   09/01     1     False         3          3   NaN           0      3
9   10/01     1     False         2          2   NaN           0      2
10  11/01     1     False         1          1   NaN           0      1
11  12/01     1     False         0          0   0.0           0      0
12  13/01     1     False         0         -1  -1.0          -1      0
13  14/01     1     False         0         -2  -2.0          -2      0
14  15/01     1      True         1         -1   NaN          -2      1
15  01/01     2     False         0         -1  -1.0          -1      0
16  02/01     2     False         0         -2  -2.0          -2      0
17  03/01     2      True         1         -1   NaN          -2      1
18  04/01     2     False         0         -2  -2.0          -2      0

CodePudding user response：

This worked for me:

import numpy as np
df['Humidity2']=np.where(df['Humitidy']==False,np.nan,1)


df['ScoreTemp']=df.groupby('Flat')['Humidity2'].transform(lambda s:s.where(s.ffill().isnull(), s.fillna(-1))).fillna(0)

df['Score']=df.groupby('Flat')['ScoreTemp'].transform('cumsum')

df.drop(['ScoreTemp','Humidity2'],axis=1,inplace=True)

df

'Humdity2' where all false from Humidity are converted to NaNs
'ScoreTemp' where we fill with -1 all Humidity2 not starting with False for a given Flat, otherwise fill with 0
'Score' doing the cumsum on that ScoreTemp calculated column grouped by Flat
Then drop the constructed columns

CodePudding user response：

From your DataFrame, first we get the first element of each group as we don't want to start with -1 if the Humidity is False :

df['first_elt'] = (df.groupby(['Flat'], as_index=False)['Humidity'].nth(0))

Then, we can change the boolean values of Humidity to 1 and -1, and 0 in the case of Humidity is False like so :

df.loc[df.Humidity == True, 'Humidity_tmp'] = 1
df.loc[df.Humidity == False, 'Humidity_tmp'] = -1
df.loc[df.first_elt == False, 'Humidity_tmp'] = 0

    Day    Flat Humidity    first_elt   Humidity_tmp
0   01/01   1   True        True        1.0
1   02/01   1   True        NaN         1.0
2   03/01   1   True        NaN         1.0
3   04/01   1   True        NaN         1.0
4   05/01   1   False       NaN         -1.0
5   06/01   1   False       NaN         -1.0
6   07/01   1   True        NaN         1.0
7   08/01   1   True        NaN         1.0
8   01/01   2   False       False       0.0
9   02/01   2   True        NaN         1.0
10  03/01   2   False       NaN         -1.0

To finish, we use a cumsum with the groupby on Flat to get the expected result :

df['Score'] = df.groupby('Flat')['Humidity_tmp'].cumsum()

Output :


    Day   Flat  Humidity    first_elt   Humidity_tmp    Score
0   01/01   1   True        True        1.0             1.0
1   02/01   1   True        NaN         1.0             2.0
2   03/01   1   True        NaN         1.0             3.0
3   04/01   1   True        NaN         1.0             4.0
4   05/01   1   False       NaN         -1.0            3.0
5   06/01   1   False       NaN         -1.0            2.0
6   07/01   1   True        NaN         1.0             3.0
7   08/01   1   True        NaN         1.0             4.0
8   01/01   2   False       False       0.0             0.0
9   02/01   2   True        NaN         1.0             1.0
10  03/01   2   False       NaN         -1.0            0.0

The full code :

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO("""
Day,Flat,Humidity
01/01,1,True
02/01,1,True
03/01,1,True
04/01,1,True
05/01,1,False
06/01,1,False
07/01,1,True
08/01,1,True
01/01,2,False
02/01,2,True
03/01,2,False
"""), sep=',')
df['first_elt'] = (df.groupby(['Flat'], as_index=False)['Humidity'].nth(0))
df.loc[df.Humidity == True, 'Humidity_tmp'] = 1
df.loc[df.Humidity == False, 'Humidity_tmp'] = -1
df.loc[df.first_elt == False, 'Humidity_tmp'] = 0
df['Score'] = df.groupby('Flat')['Humidity_tmp'].cumsum()
df