Home > Software engineering >  how to efficiently do a recursive loop with calculations in a big dataframe using pandas or numpy
how to efficiently do a recursive loop with calculations in a big dataframe using pandas or numpy

Time:09-14

I have a 4.5 million row dataframe, that I want to do a simple calculation, that basically goes to the same KEY in the previous row and if it meets some conditions, it does the a sum with the entrance [i] and [i-1] and if not it jumps the sum.

I was able to create something that works fine for what I want, and I am getting the expected results, but it takes to much time, and it calls a SettingWithCopyWarning that can be ignore if I turn down the warnings.

So I am reaching out to see if it exist a better way to do this, that is less time consuming for the machine.

my code:

df_pivot.sort_values(['key_aux', 'Date'], ascending=[True, True], inplace=True)

df_pivot['YTD_v2'] = 0

for i in range(0, len(df_pivot)):
    df_pivot['YTD_v2'].iloc[i] = np.where((df_pivot['key_aux'].iloc[i]==df_pivot['key_aux'].iloc[i-1]) & (df_pivot['Month'].iloc[i] != 1), 
                                          df_pivot['YTD_v2'].iloc[i-1] df_pivot['Value'].iloc[i], 
                                          df_pivot['Value'].iloc[i])

<timed exec>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
CPU times: total: 13min 30s
Wall time: 13min 37s

As you can see for my loop to work I need to sort the columns key_aux and Date so that he does what I am expecting.

I do believe it might exist a much better way to do this. I also did this with an standard if else loop instead of the np.where but was 30 sec more time consuming.

CodePudding user response:

Without seeing sample data and expected output it's hard to be certain, but I think what you're looking for is groupby and cumsum. Note I'm presuming you have a Year column in your dataframe, if not, given that you have a Date column it shouldn't be hard to generate one:

df_pivot['YTD_v2'] = df_pivot.groupby(['key_aux', 'Year'])['Value'].cumsum()
  • Related