A more optimized solution to pandas apply row-wise-CodePudding

I have this code that does some analysis on a DataFrame. both_profitable is True if and only if both long_profitable and short_profitable in that row are True. However, the DataFrame is quite large and using pandas apply on axis=1 is more taxing than I'd like.

output["long_profitable"] = (
    df[[c for c in df.columns if "long_profit" in c]].ge(target).any(axis=1)
)
output["short_profitable"] = (
    df[[c for c in df.columns if "short_profit" in c]].ge(target).any(axis=1)
)
output["both_profitable"] = output.apply(
    lambda x: True if x["long_profitable"] and x["short_profitable"] else False,
    axis=1,
)

Is there a simpler/more optimized way to achieve this same goal?

CodePudding user response：

You should use eq method on the columns:

output["both_profitable"] = output["long_profitable"].eq(output["short_profitable"])

Or since both columns are boolean, you could use the bitwise & operator:

output["both_profitable"]  = output["long_profitable"] & output["short_profitable"]

Also FYI, you could use str.contains loc, instead of a list comprehension to select columns of df:

output["long_profitable"] = df.loc[:, df.columns.str.contains('long_profit')].ge(target).any(axis=1)
output["short_profitable"] = df.loc[:, df.columns.str.contains('short_profit')].ge(target).any(axis=1)

CodePudding user response：

both_profitable is True if and only if both long_profitable and short_profitable in that row are True

In other words, both_profitable is the result of boolean AND on the two columns.

This can be achieved in several ways:

output['long_profitable'] & output['short_profitable']

# for any number of boolean columns, all of which we want to AND
cols = ['long_profitable', 'short_profitable']
output[cols].all(axis=1)

# same logic, using prod() -- this is just for fun; use all() instead
output[cols].prod(axis=1).astype(bool)

Of course, you can assign any of the above to a new column:

output_modified = output.assign(both_profitable=...)

Note: the 2nd and 3rd forms are of particular interest if you are AND-ing many columns.

Timing

n = 10_000_000
np.random.seed(0)
output = pd.DataFrame({
    'long_profitable': np.random.randint(0, 2, n, dtype=bool),
    'short_profitable': np.random.randint(0, 2, n, dtype=bool),
})

%timeit output['long_profitable'] & output['short_profitable']
# 4.52 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit output[cols].all(axis=1)
# 18.6 ms ± 53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit output[cols].prod(axis=1).astype(bool)
# 71.6 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)