I'm trying to apply the operation 'x-y/y'
, being x
the column 'Faturamento'
and y
column 'Custo'
from the dataframe called 'df'
, and store the results in a new column called 'Roi'
.
My attempt to use the apply function:
df['Roi'] = df.apply(lambda x, y: x['Faturamento']-y['Custo']/y['Custo'], axis=1)
Is returning:
TypeError: () missing 1 required positional argument: 'y'
How can I do this?
CodePudding user response:
I think you mean:
df['Roi'] = df.apply(lambda x: (x['Faturamento']-x['Custo'])/x['Custo'], axis=1)
x
refers to the dataframe
CodePudding user response:
You can just use the column operation with syntax like simple arithmetic. Pandas will automatically align the index for you, so that you are operating row by row for each operation.
df['Roi'] = (df['Faturamento'] - df['Custo']) / df['Custo']
or
df['Roi'] = df['Faturamento'] / df['Custo'] - 1
This way, you can enjoy the fast vectorized processing of Pandas which has been optimized to run fast. If you use .apply()
with lambda function on axis=1
, it's just a slow Python loop in underlying processing, and will be slow.
Performance Benchmark
Test 1. Small df
with 4 rows
Faturamento Custo
0 50 20
1 10 5
2 5 15
3 100 400
%%timeit
df['Roi'] = df.apply(lambda x: (x['Faturamento']-x['Custo'])/x['Custo'], axis=1)
721 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df['Roi'] = df['Faturamento'] / df['Custo'] - 1
490 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Summary: .apply
lambda takes 721 µs
while Pandas built-in takes 490 µs
: 1.47 times faster for small dataset of .
Test 2. Large df
with 40000 rows
df2 = pd.concat([df] * 10000, ignore_index=True)
%%timeit
df2['Roi'] = df2.apply(lambda x: (x['Faturamento']-x['Custo'])/x['Custo'], axis=1)
639 ms ± 3.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df2['Roi'] = df2['Faturamento'] / df2['Custo'] - 1
767 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Summary: .apply
lambda takes 639 ms (= 639,000 µs)
while Pandas built-in takes 767 µs
: 833x times faster for large dataset of .