I have a 4.5 million row dataframe, that I want to do a simple calculation, that basically goes to the same KEY
in the previous row and if it meets some conditions, it does the a sum with the entrance [i]
and [i-1]
and if not it jumps the sum.
I was able to create something that works fine for what I want, and I am getting the expected results, but it takes to much time, and it calls a SettingWithCopyWarning
that can be ignore if I turn down the warnings.
So I am reaching out to see if it exist a better way to do this, that is less time consuming for the machine.
my code:
df_pivot.sort_values(['key_aux', 'Date'], ascending=[True, True], inplace=True)
df_pivot['YTD_v2'] = 0
for i in range(0, len(df_pivot)):
df_pivot['YTD_v2'].iloc[i] = np.where((df_pivot['key_aux'].iloc[i]==df_pivot['key_aux'].iloc[i-1]) & (df_pivot['Month'].iloc[i] != 1),
df_pivot['YTD_v2'].iloc[i-1] df_pivot['Value'].iloc[i],
df_pivot['Value'].iloc[i])
<timed exec>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
CPU times: total: 13min 30s
Wall time: 13min 37s
As you can see for my loop to work I need to sort the columns key_aux
and Date
so that he does what I am expecting.
I do believe it might exist a much better way to do this. I also did this with an standard if else
loop instead of the np.where
but was 30 sec more time consuming.
CodePudding user response:
Without seeing sample data and expected output it's hard to be certain, but I think what you're looking for is groupby
and cumsum
. Note I'm presuming you have a Year
column in your dataframe, if not, given that you have a Date
column it shouldn't be hard to generate one:
df_pivot['YTD_v2'] = df_pivot.groupby(['key_aux', 'Year'])['Value'].cumsum()