I have a dataframe and I want to add another column with certain values which are updated on rows after the BH is yes and is updated until the date is changed
The value to to entered is diff of CP and DP
Date CP DP BH
2/9/2016 122.2 122.13
2/9/2016 122 122.13
2/9/2016 122.35 122.13
2/9/2016 122.85 122.13 Yes
2/9/2016 122.75 122.13
2/9/2016 122.95 122.13
2/9/2016 123 122.13
2/10/2016 123.65 122.12
2/10/2016 123.19 122.12
2/10/2016 122.45 122.12 Yes
2/10/2016 121.85 122.12
2/10/2016 121.75 122.12
2/10/2016 121.35 122.12
The expected output is
Date CP DP BH Distance
2/9/2016 122.2 122.13
2/9/2016 122 122.13
2/9/2016 122.35 122.13
2/9/2016 122.85 122.13 Yes
2/9/2016 122.75 122.13 0.62
2/9/2016 122.95 122.13 0.82
2/9/2016 123 122.13 0.87
2/10/2016 123.65 122.12
2/10/2016 123.19 122.12
2/10/2016 122.45 122.12 Yes
2/10/2016 121.85 122.12 -0.27
2/10/2016 121.75 122.12 -0.37
2/10/2016 121.35 122.12 -0.77
so basically distance is updated for all the rows after the BH is yes for the same date, the same action is done for each date. I have over 2000000 records in the dataframe.
I have been able to update BH as yes using np.where however not able to workaround for updating the distance column. Considering the amount of data I have, I am trying to avoid loops, any suggestions
CodePudding user response:
Compare column BH
by Series.eq
and then by GroupBy.cummax
get all True
s after match per groups, last subtract filtered rows to new column Distance
:
m = df['BH'].eq('Yes')
mask = m.groupby(df['Date']).cummax() & ~m
df['Distance'] = df.loc[mask, 'CP'].sub(df.loc[mask, 'DP'])
print (df)
Date CP DP BH Distance
0 2/9/2016 122.20 122.13 NaN NaN
1 2/9/2016 122.00 122.13 NaN NaN
2 2/9/2016 122.35 122.13 NaN NaN
3 2/9/2016 122.85 122.13 Yes NaN
4 2/9/2016 122.75 122.13 NaN 0.62
5 2/9/2016 122.95 122.13 NaN 0.82
6 2/9/2016 123.00 122.13 NaN 0.87
7 2/10/2016 123.65 122.12 NaN NaN
8 2/10/2016 123.19 122.12 NaN NaN
9 2/10/2016 122.45 122.12 Yes NaN
10 2/10/2016 121.85 122.12 NaN -0.27
11 2/10/2016 121.75 122.12 NaN -0.37
12 2/10/2016 121.35 122.12 NaN -0.77
CodePudding user response:
Another possible solution:
df['Distance'] = df['CP'].sub(df['DP']).where(
df.groupby('Date')['BH'].ffill().eq('Yes') & df['BH'].ne('Yes'))
Output:
Date CP DP BH Distance
0 2/9/2016 122.20 122.13 NaN NaN
1 2/9/2016 122.00 122.13 NaN NaN
2 2/9/2016 122.35 122.13 NaN NaN
3 2/9/2016 122.85 122.13 Yes NaN
4 2/9/2016 122.75 122.13 NaN 0.62
5 2/9/2016 122.95 122.13 NaN 0.82
6 2/9/2016 123.00 122.13 NaN 0.87
7 2/10/2016 123.65 122.12 NaN NaN
8 2/10/2016 123.19 122.12 NaN NaN
9 2/10/2016 122.45 122.12 Yes NaN
10 2/10/2016 121.85 122.12 NaN -0.27
11 2/10/2016 121.75 122.12 NaN -0.37
12 2/10/2016 121.35 122.12 NaN -0.77