Python Dataframe: Update Row value once a condition is met and keep updating until another condition-CodePudding

I have a dataframe and I want to add another column with certain values which are updated on rows after the BH is yes and is updated until the date is changed

The value to to entered is diff of CP and DP

Date        CP      DP      BH
2/9/2016    122.2   122.13
2/9/2016    122 122.13  
2/9/2016    122.35  122.13
2/9/2016    122.85  122.13  Yes
2/9/2016    122.75  122.13
2/9/2016    122.95  122.13  
2/9/2016    123     122.13  
2/10/2016   123.65  122.12  
2/10/2016   123.19  122.12  
2/10/2016   122.45  122.12  Yes
2/10/2016   121.85  122.12  
2/10/2016   121.75  122.12  
2/10/2016   121.35  122.12

The expected output is

Date        CP      DP      BH  Distance
2/9/2016    122.2   122.13      
2/9/2016    122 122.13      
2/9/2016    122.35  122.13      
2/9/2016    122.85  122.13  Yes 
2/9/2016    122.75  122.13      0.62
2/9/2016    122.95  122.13      0.82
2/9/2016    123     122.13      0.87
2/10/2016   123.65  122.12      
2/10/2016   123.19  122.12      
2/10/2016   122.45  122.12  Yes 
2/10/2016   121.85  122.12      -0.27
2/10/2016   121.75  122.12      -0.37
2/10/2016   121.35  122.12      -0.77

so basically distance is updated for all the rows after the BH is yes for the same date, the same action is done for each date. I have over 2000000 records in the dataframe.

I have been able to update BH as yes using np.where however not able to workaround for updating the distance column. Considering the amount of data I have, I am trying to avoid loops, any suggestions

CodePudding user response：

Compare column BH by Series.eq and then by GroupBy.cummax get all Trues after match per groups, last subtract filtered rows to new column Distance:

m = df['BH'].eq('Yes')
mask = m.groupby(df['Date']).cummax() & ~m

df['Distance'] = df.loc[mask, 'CP'].sub(df.loc[mask, 'DP'])
print (df)
         Date      CP      DP   BH  Distance
0    2/9/2016  122.20  122.13  NaN       NaN
1    2/9/2016  122.00  122.13  NaN       NaN
2    2/9/2016  122.35  122.13  NaN       NaN
3    2/9/2016  122.85  122.13  Yes       NaN
4    2/9/2016  122.75  122.13  NaN      0.62
5    2/9/2016  122.95  122.13  NaN      0.82
6    2/9/2016  123.00  122.13  NaN      0.87
7   2/10/2016  123.65  122.12  NaN       NaN
8   2/10/2016  123.19  122.12  NaN       NaN
9   2/10/2016  122.45  122.12  Yes       NaN
10  2/10/2016  121.85  122.12  NaN     -0.27
11  2/10/2016  121.75  122.12  NaN     -0.37
12  2/10/2016  121.35  122.12  NaN     -0.77

CodePudding user response：

Another possible solution:

df['Distance'] = df['CP'].sub(df['DP']).where(
    df.groupby('Date')['BH'].ffill().eq('Yes') & df['BH'].ne('Yes'))

Output:

         Date      CP      DP   BH  Distance
0    2/9/2016  122.20  122.13  NaN       NaN
1    2/9/2016  122.00  122.13  NaN       NaN
2    2/9/2016  122.35  122.13  NaN       NaN
3    2/9/2016  122.85  122.13  Yes       NaN
4    2/9/2016  122.75  122.13  NaN      0.62
5    2/9/2016  122.95  122.13  NaN      0.82
6    2/9/2016  123.00  122.13  NaN      0.87
7   2/10/2016  123.65  122.12  NaN       NaN
8   2/10/2016  123.19  122.12  NaN       NaN
9   2/10/2016  122.45  122.12  Yes       NaN
10  2/10/2016  121.85  122.12  NaN     -0.27
11  2/10/2016  121.75  122.12  NaN     -0.37
12  2/10/2016  121.35  122.12  NaN     -0.77