With pandas, how to speed up iteration that requires using the previous rows calculated value?-CodePudding

Here's the code in question:

import pandas as pd

df = pd.DataFrame(
    [
        list(range(200)),
        list(range(200, 400))
    ],
    index=['col_1', 'col_2']
).transpose()

col_1_index = df.columns.get_loc('col_1')
col_2_index = df.columns.get_loc('col_2')

target_1 = 2

for i in range(2, len(df)):
    if (
        df.iloc[i - 2, col_1_index] -
        df.iloc[i - 1, col_2_index]
    ) > target_1:
        col_2_value = (
            df.iloc[i - 1, col_2_index]  
            target_1
        )
    elif (
        df.iloc[i - 1, col_2_index] -
        df.iloc[i - 2, col_1_index]
    ) > target_1:
        col_2_value = (
            df.iloc[i - 1, col_2_index] -
            target_1
        )
    else:
        col_2_value = df.iloc[i - 2, col_1_index]

    df.iloc[i, col_2_index] = col_2_value

df

'''
# expected output
  col_1 col_2
0   0    200
1   1    201
2   2    199
3   3    197
4   4    195
... ...  ...
195 195  193
196 196  194
197 197  195
198 198  196
199 199  197
'''

My issue is I can't use the common methods of speeding up the iteration such as df.itertuples() or df.apply() because I am referencing the previous row's calculated value.

The logic is iterating over the DataFrame comparing the t-2 col_1 value with the t-1 col_2 value to decide what to assign to the t col_2 value. So col_1 is static, while the col_2 time t value is updated each iteration.

CodePudding user response：

Well itertuples or apply are not especially fast, see this answer. Your main problem is you access cells several times in the loop and assign value to a cell at each loop. One can be more efficient by just looping on values from col_1 and have a variable to keep the previous value calculated, append the result in a list (faster than assign to a specific cell) and outside of the loop, assign the column at once. Here is one way:

prev_val2 = df.iloc[1, col_2_index] 
l_val2 = [df.iloc[0, col_2_index], prev_val2] #for the results

for val1 in df['col_1'].to_numpy()[:-2]:
    if (val1-prev_val2)>target_1:
        prev_val2  = target_1
    elif (prev_val2-val1)>target_1:
        prev_val2 -= target_1
    else:
        prev_val2 = val1
    l_val2.append(prev_val2)

df['col_2_fast'] = l_val2 #assign outside of the loop

Now on a 200-row dataframe, the speed comparison goes from 61.5 ms with your method to 0.380 ms with this one (about 160 times faster) and the gain should increase with the size of the dataframe. And of course, after running your method and this one, df['col_2'].eq(df['col_2_fast']).all() gives True.

CodePudding user response：

I solved the issue by taking @nickodell recommendation to save the previous value as a variable.

Time comparison of original code vs optimized code below:

CPU times: total: 46.9 ms vs. CPU times: total: 15.6 ms

import pandas as pd

def calc(df):
    col_1_tmin2_val = df['col_1'][0]
    col_1_tmin1_val = df['col_1'][1]
    col_2_tmin1_val = df['col_2'][1]
    
    def func_apply(row: pd.Series) -> float:
        nonlocal col_1_tmin2_val
        nonlocal col_1_tmin1_val
        nonlocal col_2_tmin1_val

        if (col_1_tmin2_val - col_1_tmin1_val) > target_1:
            col_2_value = (col_2_tmin1_val   target_1)
        elif (col_2_tmin1_val - col_1_tmin2_val) > target_1:
            col_2_value = (col_2_tmin1_val - target_1)
        else:
            col_2_value = col_1_tmin2_val

        col_1_tmin2_val = col_1_tmin1_val
        col_1_tmin1_val = row['col_1']
        col_2_tmin1_val = col_2_value

        return col_2_value

    df.iloc[2:, col_2_index] = df.iloc[2:].apply(lambda x: func_apply(x), axis=1)

df = pd.DataFrame(
    [
        list(range(200)),
        list(range(200, 400))
    ],
    index=['col_1', 'col_2']
).transpose()

col_2_index = df.columns.get_loc('col_2')

target_1 = 2
    
calc(df)
    
df