How to perform an operation with two columns in the same dataframe in Python Pandas?-CodePudding

I'm trying to apply the operation 'x-y/y', being x the column 'Faturamento' and y column 'Custo' from the dataframe called 'df', and store the results in a new column called 'Roi'.

My attempt to use the apply function:

df['Roi'] = df.apply(lambda x, y: x['Faturamento']-y['Custo']/y['Custo'], axis=1)

Is returning:

TypeError: () missing 1 required positional argument: 'y'

How can I do this?

CodePudding user response：

I think you mean:

df['Roi'] = df.apply(lambda x: (x['Faturamento']-x['Custo'])/x['Custo'], axis=1)

x refers to the dataframe

CodePudding user response：

You can just use the column operation with syntax like simple arithmetic. Pandas will automatically align the index for you, so that you are operating row by row for each operation.

df['Roi'] = (df['Faturamento'] - df['Custo']) / df['Custo']

df['Roi'] = df['Faturamento'] / df['Custo'] - 1

This way, you can enjoy the fast vectorized processing of Pandas which has been optimized to run fast. If you use .apply() with lambda function on axis=1, it's just a slow Python loop in underlying processing, and will be slow.

Performance Benchmark

Test 1. Small df with 4 rows

   Faturamento  Custo
0           50     20
1           10      5
2            5     15
3          100    400

%%timeit
df['Roi'] = df.apply(lambda x: (x['Faturamento']-x['Custo'])/x['Custo'], axis=1)

721 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df['Roi'] = df['Faturamento'] / df['Custo'] - 1

490 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Summary: .apply lambda takes 721 µs while Pandas built-in takes 490 µs: 1.47 times faster for small dataset of .

Test 2. Large df with 40000 rows

df2 = pd.concat([df] * 10000, ignore_index=True)

%%timeit
df2['Roi'] = df2.apply(lambda x: (x['Faturamento']-x['Custo'])/x['Custo'], axis=1)

639 ms ± 3.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df2['Roi'] = df2['Faturamento'] / df2['Custo'] - 1

767 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Summary: .apply lambda takes 639 ms (= 639,000 µs) while Pandas built-in takes 767 µs: 833x times faster for large dataset of .