I have a dataframe with IOT sensors data. What I'm trying to get a humidity score over time when a condition is met. Let's say that my df looks like this:
Day Flat Humidity
31/12 1 False
01/01 1 True
02/01 1 True
03/01 1 True
04/01 1 True
05/01 1 False
06/01 1 False
07/01 1 True
08/01 1 True
09/01 1 False
10/01 1 False
11/01 1 False
12/01 1 False
13/01 1 False
14/01 1 False
15/01 1 True
01/01 2 False
02/01 2 False
03/01 2 True
04/01 2 False
What I would like to get is the humidity score which is the cumulative sum of when Humidity = True. The thing I do not manage to get is when Humidity = False, I do not want the sum to start from 0 but subtract 1 from the previous value when it's not equal to 0.
In my example, it should look like this:
Day Flat Humidity Score
31/12 1 False 0
01/01 1 True 1
02/01 1 True 2
03/01 1 True 3
04/01 1 True 4
05/01 1 False 3
06/01 1 False 2
07/01 1 True 3
08/01 1 True 4
09/01 1 False 3
10/01 1 False 2
11/01 1 False 1
12/01 1 False 0
13/01 1 False 0
14/01 1 False 0
15/01 1 True 1
01/01 2 False 0
02/01 2 False 0
03/01 2 True 1
04/01 2 False 0
As you can see, on the 05/01 when the Humidity = False, I do not want the count to start from 0 but be the number of previous row - 1; in my case 4-1 = 3. Also, this needs to be computed by flat and the score cannot be a value below 0.
Any help would be greatly appreciated.
CodePudding user response:
You can use a cumsum
per group, and then correct for the points that went below the minimum:
s = df['Humidity'].mul(2).sub(1).groupby(df['Flat']).cumsum()
mask = s.mask(df['Humidity']|s.gt(0))
corr = mask.groupby(df['Flat']).ffill().fillna(0, downcast='infer')
df['Score'] = s-corr
output (with intermediates):
Day Flat Humidity expected raw_score mask correction Score
0 01/01 1 True 1 1 NaN 0 1
1 02/01 1 True 2 2 NaN 0 2
2 03/01 1 True 3 3 NaN 0 3
3 04/01 1 True 4 4 NaN 0 4
4 05/01 1 False 3 3 NaN 0 3
5 06/01 1 False 2 2 NaN 0 2
6 07/01 1 True 3 3 NaN 0 3
7 08/01 1 True 4 4 NaN 0 4
8 09/01 1 False 3 3 NaN 0 3
9 10/01 1 False 2 2 NaN 0 2
10 11/01 1 False 1 1 NaN 0 1
11 12/01 1 False 0 0 0.0 0 0
12 13/01 1 False 0 -1 -1.0 -1 0
13 14/01 1 False 0 -2 -2.0 -2 0
14 15/01 1 True 1 -1 NaN -2 1
15 01/01 2 False 0 -1 -1.0 -1 0
16 02/01 2 False 0 -2 -2.0 -2 0
17 03/01 2 True 1 -1 NaN -2 1
18 04/01 2 False 0 -2 -2.0 -2 0
CodePudding user response:
This worked for me:
import numpy as np
df['Humidity2']=np.where(df['Humitidy']==False,np.nan,1)
df['ScoreTemp']=df.groupby('Flat')['Humidity2'].transform(lambda s:s.where(s.ffill().isnull(), s.fillna(-1))).fillna(0)
df['Score']=df.groupby('Flat')['ScoreTemp'].transform('cumsum')
df.drop(['ScoreTemp','Humidity2'],axis=1,inplace=True)
df
'Humdity2' where all false from Humidity are converted to NaNs
'ScoreTemp' where we fill with -1 all Humidity2 not starting with False for a given Flat, otherwise fill with 0
'Score' doing the cumsum on that ScoreTemp calculated column grouped by Flat
Then drop the constructed columns
CodePudding user response:
From your DataFrame
, first we get the first element of each group as we don't want to start with -1
if the Humidity
is False
:
df['first_elt'] = (df.groupby(['Flat'], as_index=False)['Humidity'].nth(0))
Then, we can change the boolean values of Humidity
to 1
and -1
, and 0
in the case of Humidity
is False
like so :
df.loc[df.Humidity == True, 'Humidity_tmp'] = 1
df.loc[df.Humidity == False, 'Humidity_tmp'] = -1
df.loc[df.first_elt == False, 'Humidity_tmp'] = 0
Day Flat Humidity first_elt Humidity_tmp
0 01/01 1 True True 1.0
1 02/01 1 True NaN 1.0
2 03/01 1 True NaN 1.0
3 04/01 1 True NaN 1.0
4 05/01 1 False NaN -1.0
5 06/01 1 False NaN -1.0
6 07/01 1 True NaN 1.0
7 08/01 1 True NaN 1.0
8 01/01 2 False False 0.0
9 02/01 2 True NaN 1.0
10 03/01 2 False NaN -1.0
To finish, we use a cumsum
with the groupby
on Flat
to get the expected result :
df['Score'] = df.groupby('Flat')['Humidity_tmp'].cumsum()
Output :
Day Flat Humidity first_elt Humidity_tmp Score
0 01/01 1 True True 1.0 1.0
1 02/01 1 True NaN 1.0 2.0
2 03/01 1 True NaN 1.0 3.0
3 04/01 1 True NaN 1.0 4.0
4 05/01 1 False NaN -1.0 3.0
5 06/01 1 False NaN -1.0 2.0
6 07/01 1 True NaN 1.0 3.0
7 08/01 1 True NaN 1.0 4.0
8 01/01 2 False False 0.0 0.0
9 02/01 2 True NaN 1.0 1.0
10 03/01 2 False NaN -1.0 0.0
The full code :
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO("""
Day,Flat,Humidity
01/01,1,True
02/01,1,True
03/01,1,True
04/01,1,True
05/01,1,False
06/01,1,False
07/01,1,True
08/01,1,True
01/01,2,False
02/01,2,True
03/01,2,False
"""), sep=',')
df['first_elt'] = (df.groupby(['Flat'], as_index=False)['Humidity'].nth(0))
df.loc[df.Humidity == True, 'Humidity_tmp'] = 1
df.loc[df.Humidity == False, 'Humidity_tmp'] = -1
df.loc[df.first_elt == False, 'Humidity_tmp'] = 0
df['Score'] = df.groupby('Flat')['Humidity_tmp'].cumsum()
df