Home > Back-end >  Cumulative sum and substract over time when condition are met
Cumulative sum and substract over time when condition are met

Time:05-11

I have a dataframe with IOT sensors data. What I'm trying to get a humidity score over time when a condition is met. Let's say that my df looks like this:

Day       Flat    Humidity
31/12     1       False
01/01     1       True
02/01     1       True
03/01     1       True
04/01     1       True
05/01     1       False
06/01     1       False
07/01     1       True
08/01     1       True
09/01     1       False
10/01     1       False
11/01     1       False
12/01     1       False
13/01     1       False
14/01     1       False
15/01     1       True
01/01     2       False
02/01     2       False
03/01     2       True
04/01     2       False

What I would like to get is the humidity score which is the cumulative sum of when Humidity = True. The thing I do not manage to get is when Humidity = False, I do not want the sum to start from 0 but subtract 1 from the previous value when it's not equal to 0.

In my example, it should look like this:

Day       Flat    Humidity    Score
31/12     1       False       0
01/01     1       True        1
02/01     1       True        2
03/01     1       True        3
04/01     1       True        4
05/01     1       False       3
06/01     1       False       2
07/01     1       True        3
08/01     1       True        4
09/01     1       False       3
10/01     1       False       2
11/01     1       False       1
12/01     1       False       0
13/01     1       False       0
14/01     1       False       0
15/01     1       True        1
01/01     2       False       0
02/01     2       False       0
03/01     2       True        1
04/01     2       False       0

As you can see, on the 05/01 when the Humidity = False, I do not want the count to start from 0 but be the number of previous row - 1; in my case 4-1 = 3. Also, this needs to be computed by flat and the score cannot be a value below 0.

Any help would be greatly appreciated.

CodePudding user response:

You can use a cumsum per group, and then correct for the points that went below the minimum:

s = df['Humidity'].mul(2).sub(1).groupby(df['Flat']).cumsum()
mask = s.mask(df['Humidity']|s.gt(0))
corr = mask.groupby(df['Flat']).ffill().fillna(0, downcast='infer')
df['Score'] = s-corr

output (with intermediates):

      Day  Flat  Humidity  expected  raw_score  mask  correction  Score
0   01/01     1      True         1          1   NaN           0      1
1   02/01     1      True         2          2   NaN           0      2
2   03/01     1      True         3          3   NaN           0      3
3   04/01     1      True         4          4   NaN           0      4
4   05/01     1     False         3          3   NaN           0      3
5   06/01     1     False         2          2   NaN           0      2
6   07/01     1      True         3          3   NaN           0      3
7   08/01     1      True         4          4   NaN           0      4
8   09/01     1     False         3          3   NaN           0      3
9   10/01     1     False         2          2   NaN           0      2
10  11/01     1     False         1          1   NaN           0      1
11  12/01     1     False         0          0   0.0           0      0
12  13/01     1     False         0         -1  -1.0          -1      0
13  14/01     1     False         0         -2  -2.0          -2      0
14  15/01     1      True         1         -1   NaN          -2      1
15  01/01     2     False         0         -1  -1.0          -1      0
16  02/01     2     False         0         -2  -2.0          -2      0
17  03/01     2      True         1         -1   NaN          -2      1
18  04/01     2     False         0         -2  -2.0          -2      0

CodePudding user response:

This worked for me:

import numpy as np
df['Humidity2']=np.where(df['Humitidy']==False,np.nan,1)


df['ScoreTemp']=df.groupby('Flat')['Humidity2'].transform(lambda s:s.where(s.ffill().isnull(), s.fillna(-1))).fillna(0)

df['Score']=df.groupby('Flat')['ScoreTemp'].transform('cumsum')

df.drop(['ScoreTemp','Humidity2'],axis=1,inplace=True)

df
  • 'Humdity2' where all false from Humidity are converted to NaNs

  • 'ScoreTemp' where we fill with -1 all Humidity2 not starting with False for a given Flat, otherwise fill with 0

  • 'Score' doing the cumsum on that ScoreTemp calculated column grouped by Flat

  • Then drop the constructed columns

CodePudding user response:

From your DataFrame, first we get the first element of each group as we don't want to start with -1 if the Humidity is False :

df['first_elt'] = (df.groupby(['Flat'], as_index=False)['Humidity'].nth(0))

Then, we can change the boolean values of Humidity to 1 and -1, and 0 in the case of Humidity is False like so :

df.loc[df.Humidity == True, 'Humidity_tmp'] = 1
df.loc[df.Humidity == False, 'Humidity_tmp'] = -1
df.loc[df.first_elt == False, 'Humidity_tmp'] = 0
    Day    Flat Humidity    first_elt   Humidity_tmp
0   01/01   1   True        True        1.0
1   02/01   1   True        NaN         1.0
2   03/01   1   True        NaN         1.0
3   04/01   1   True        NaN         1.0
4   05/01   1   False       NaN         -1.0
5   06/01   1   False       NaN         -1.0
6   07/01   1   True        NaN         1.0
7   08/01   1   True        NaN         1.0
8   01/01   2   False       False       0.0
9   02/01   2   True        NaN         1.0
10  03/01   2   False       NaN         -1.0

To finish, we use a cumsum with the groupby on Flat to get the expected result :

df['Score'] = df.groupby('Flat')['Humidity_tmp'].cumsum()

Output :


    Day   Flat  Humidity    first_elt   Humidity_tmp    Score
0   01/01   1   True        True        1.0             1.0
1   02/01   1   True        NaN         1.0             2.0
2   03/01   1   True        NaN         1.0             3.0
3   04/01   1   True        NaN         1.0             4.0
4   05/01   1   False       NaN         -1.0            3.0
5   06/01   1   False       NaN         -1.0            2.0
6   07/01   1   True        NaN         1.0             3.0
7   08/01   1   True        NaN         1.0             4.0
8   01/01   2   False       False       0.0             0.0
9   02/01   2   True        NaN         1.0             1.0
10  03/01   2   False       NaN         -1.0            0.0

The full code :

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO("""
Day,Flat,Humidity
01/01,1,True
02/01,1,True
03/01,1,True
04/01,1,True
05/01,1,False
06/01,1,False
07/01,1,True
08/01,1,True
01/01,2,False
02/01,2,True
03/01,2,False
"""), sep=',')
df['first_elt'] = (df.groupby(['Flat'], as_index=False)['Humidity'].nth(0))
df.loc[df.Humidity == True, 'Humidity_tmp'] = 1
df.loc[df.Humidity == False, 'Humidity_tmp'] = -1
df.loc[df.first_elt == False, 'Humidity_tmp'] = 0
df['Score'] = df.groupby('Flat')['Humidity_tmp'].cumsum()
df
  • Related