I have this code that does some analysis on a DataFrame. both_profitable
is True
if and only if both long_profitable
and short_profitable
in that row are True
. However, the DataFrame is quite large and using pandas apply
on axis=1
is more taxing than I'd like.
output["long_profitable"] = (
df[[c for c in df.columns if "long_profit" in c]].ge(target).any(axis=1)
)
output["short_profitable"] = (
df[[c for c in df.columns if "short_profit" in c]].ge(target).any(axis=1)
)
output["both_profitable"] = output.apply(
lambda x: True if x["long_profitable"] and x["short_profitable"] else False,
axis=1,
)
Is there a simpler/more optimized way to achieve this same goal?
CodePudding user response:
You should use eq
method on the columns:
output["both_profitable"] = output["long_profitable"].eq(output["short_profitable"])
Or since both columns are boolean, you could use the bitwise &
operator:
output["both_profitable"] = output["long_profitable"] & output["short_profitable"]
Also FYI, you could use str.contains
loc
, instead of a list comprehension to select columns of df
:
output["long_profitable"] = df.loc[:, df.columns.str.contains('long_profit')].ge(target).any(axis=1)
output["short_profitable"] = df.loc[:, df.columns.str.contains('short_profit')].ge(target).any(axis=1)
CodePudding user response:
both_profitable
isTrue
if and only if bothlong_profitable
andshort_profitable
in that row areTrue
In other words, both_profitable
is the result of boolean AND on the two columns.
This can be achieved in several ways:
output['long_profitable'] & output['short_profitable']
# for any number of boolean columns, all of which we want to AND
cols = ['long_profitable', 'short_profitable']
output[cols].all(axis=1)
# same logic, using prod() -- this is just for fun; use all() instead
output[cols].prod(axis=1).astype(bool)
Of course, you can assign any of the above to a new column:
output_modified = output.assign(both_profitable=...)
Note: the 2nd and 3rd forms are of particular interest if you are AND-ing many columns.
Timing
n = 10_000_000
np.random.seed(0)
output = pd.DataFrame({
'long_profitable': np.random.randint(0, 2, n, dtype=bool),
'short_profitable': np.random.randint(0, 2, n, dtype=bool),
})
%timeit output['long_profitable'] & output['short_profitable']
# 4.52 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit output[cols].all(axis=1)
# 18.6 ms ± 53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit output[cols].prod(axis=1).astype(bool)
# 71.6 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)