Python - Optimize iteration on Dataframe-CodePudding

I need to iterate on a Dataframe to get values of every rows, elaborate with a function and append to another dataframe the results.

Everything is working, but for sure there is ways to optimize the code to reduce the compile time to get the files at the end.

I need to get return values from the first function, which i pass the row. With these return values i call another function to append the values.

Now i iterate on the dataframe in this way:

for row in excel.itertuples():
    param1, param2, param3, param4, param5 = ElaborateExcel(row)
    df = SaveToDataFrame(df, param1, param2, param3, param4, param5)

Thanks

CodePudding user response：

Iterating over pandas dataframes is painfully slow and inefficient, and you have reached the right conclusion to want to change it. There are multiple alternatives to looping, which is not exactly the same procedure but can accomplish almost anything that you'd want to do using a for loop.

My favorite approach is writing a custom function and then using .apply() to apply the function to your dataframe, either row-wise (axis=0) or column-wise (axis=1). Here is a simple example on a pandas series:

import pandas as pd
import numpy as np

data= np.random.rand(5,1)
df = pd.DataFrame(data=data)

def multiply(x):
    return x*2

df['multiply'] = df.apply(multiply, axis=1)
print(df)

You can go further and use the same method on a wider dataframe, but you have to use lambda function in your apply method:

data= np.random.rand(5,3)
df = pd.DataFrame(data=data)
df

Output:

    0                1         2
0   0.652011    0.358764    0.993731
1   0.690814    0.390272    0.627235
2   0.282919    0.237765    0.417414
3   0.123283    0.823570    0.992152
4   0.700824    0.604650    0.147833

def multiply(x1, x2, x3):
    return x1*2   x2*2   x3*2

df['multiply'] = df.apply(lambda row: multiply(row[0], row[1], row[2]), axis=1)
df

Output:

    0               1           2        multiply
0   0.652011    0.358764    0.993731    4.009011
1   0.690814    0.390272    0.627235    3.416642
2   0.282919    0.237765    0.417414    1.876196
3   0.123283    0.823570    0.992152    3.878009
4   0.700824    0.604650    0.147833    2.906615

Regarding the second part of your question, you can use pd.merge() if you need to merge on one or multiple id columns, or pd.join() and pd.concat() otherwise.