Here's the code in question:
import pandas as pd
df = pd.DataFrame(
[
list(range(200)),
list(range(200, 400))
],
index=['col_1', 'col_2']
).transpose()
col_1_index = df.columns.get_loc('col_1')
col_2_index = df.columns.get_loc('col_2')
target_1 = 2
for i in range(2, len(df)):
if (
df.iloc[i - 2, col_1_index] -
df.iloc[i - 1, col_2_index]
) > target_1:
col_2_value = (
df.iloc[i - 1, col_2_index]
target_1
)
elif (
df.iloc[i - 1, col_2_index] -
df.iloc[i - 2, col_1_index]
) > target_1:
col_2_value = (
df.iloc[i - 1, col_2_index] -
target_1
)
else:
col_2_value = df.iloc[i - 2, col_1_index]
df.iloc[i, col_2_index] = col_2_value
df
'''
# expected output
col_1 col_2
0 0 200
1 1 201
2 2 199
3 3 197
4 4 195
... ... ...
195 195 193
196 196 194
197 197 195
198 198 196
199 199 197
'''
My issue is I can't use the common methods of speeding up the iteration such as df.itertuples()
or df.apply()
because I am referencing the previous row's calculated value.
The logic is iterating over the DataFrame
comparing the t-2
col_1 value with the t-1
col_2 value to decide what to assign to the t
col_2 value. So col_1 is static, while the col_2 time t
value is updated each iteration.
CodePudding user response:
Well itertuples
or apply
are not especially fast, see this answer. Your main problem is you access cells several times in the loop and assign value to a cell at each loop. One can be more efficient by just looping on values from col_1 and have a variable to keep the previous value calculated, append
the result in a list (faster than assign to a specific cell) and outside of the loop, assign the column at once. Here is one way:
prev_val2 = df.iloc[1, col_2_index]
l_val2 = [df.iloc[0, col_2_index], prev_val2] #for the results
for val1 in df['col_1'].to_numpy()[:-2]:
if (val1-prev_val2)>target_1:
prev_val2 = target_1
elif (prev_val2-val1)>target_1:
prev_val2 -= target_1
else:
prev_val2 = val1
l_val2.append(prev_val2)
df['col_2_fast'] = l_val2 #assign outside of the loop
Now on a 200-row dataframe, the speed comparison goes from 61.5 ms with your method to 0.380 ms with this one (about 160 times faster) and the gain should increase with the size of the dataframe. And of course, after running your method and this one, df['col_2'].eq(df['col_2_fast']).all()
gives True
.
CodePudding user response:
I solved the issue by taking @nickodell recommendation to save the previous value as a variable.
Time comparison of original code vs optimized code below:
CPU times: total: 46.9 ms
vs. CPU times: total: 15.6 ms
import pandas as pd
def calc(df):
col_1_tmin2_val = df['col_1'][0]
col_1_tmin1_val = df['col_1'][1]
col_2_tmin1_val = df['col_2'][1]
def func_apply(row: pd.Series) -> float:
nonlocal col_1_tmin2_val
nonlocal col_1_tmin1_val
nonlocal col_2_tmin1_val
if (col_1_tmin2_val - col_1_tmin1_val) > target_1:
col_2_value = (col_2_tmin1_val target_1)
elif (col_2_tmin1_val - col_1_tmin2_val) > target_1:
col_2_value = (col_2_tmin1_val - target_1)
else:
col_2_value = col_1_tmin2_val
col_1_tmin2_val = col_1_tmin1_val
col_1_tmin1_val = row['col_1']
col_2_tmin1_val = col_2_value
return col_2_value
df.iloc[2:, col_2_index] = df.iloc[2:].apply(lambda x: func_apply(x), axis=1)
df = pd.DataFrame(
[
list(range(200)),
list(range(200, 400))
],
index=['col_1', 'col_2']
).transpose()
col_2_index = df.columns.get_loc('col_2')
target_1 = 2
calc(df)
df