how to efficiently do a recursive loop with calculations in a big dataframe using pandas or numpy-CodePudding

I have a 4.5 million row dataframe, that I want to do a simple calculation, that basically goes to the same KEY in the previous row and if it meets some conditions, it does the a sum with the entrance [i] and [i-1] and if not it jumps the sum.

I was able to create something that works fine for what I want, and I am getting the expected results, but it takes to much time, and it calls a SettingWithCopyWarning that can be ignore if I turn down the warnings.

So I am reaching out to see if it exist a better way to do this, that is less time consuming for the machine.

my code:

df_pivot.sort_values(['key_aux', 'Date'], ascending=[True, True], inplace=True)

df_pivot['YTD_v2'] = 0

for i in range(0, len(df_pivot)):
    df_pivot['YTD_v2'].iloc[i] = np.where((df_pivot['key_aux'].iloc[i]==df_pivot['key_aux'].iloc[i-1]) & (df_pivot['Month'].iloc[i] != 1), 
                                          df_pivot['YTD_v2'].iloc[i-1] df_pivot['Value'].iloc[i], 
                                          df_pivot['Value'].iloc[i])

<timed exec>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
CPU times: total: 13min 30s
Wall time: 13min 37s

As you can see for my loop to work I need to sort the columns key_aux and Date so that he does what I am expecting.

I do believe it might exist a much better way to do this. I also did this with an standard if else loop instead of the np.where but was 30 sec more time consuming.

CodePudding user response：

Without seeing sample data and expected output it's hard to be certain, but I think what you're looking for is groupby and cumsum. Note I'm presuming you have a Year column in your dataframe, if not, given that you have a Date column it shouldn't be hard to generate one:

df_pivot['YTD_v2'] = df_pivot.groupby(['key_aux', 'Year'])['Value'].cumsum()