Given a large DataFrame df
, which is faster in general?
# combining the masks first
sub_df = df[(df["column1"] < 5) & (df["column2"] > 10)]
# applying the masks sequentially
sub_df = df[df["column1"] < 5]
sub_df = sub_df[sub_df["column2"] > 10]
The first approach only selects from the DataFrame once which may be faster, however, the second selection in the second example only has to consider a smaller DataFrame.
CodePudding user response:
It depends on your dataset.
First let's generate a DataFrame with almost all values that should be dropped in the first condition:
n = 1_000_000
p = 0.0001
np.random.seed(0)
df = pd.DataFrame({'column1': np.random.choice([0,6], size=n, p=[p, 1-p]),
'column2': np.random.choice([0,20], size=n)})
And as expected:
# simultaneous conditions
5.69 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# successive slicing
2.99 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It is faster to first generate a small intermediate.
Now, let's change the probability to p = 0.9999
. This means that the first condition will remove very few rows.
We could expect both solutions to run with a similar speed, but:
# simultaneous conditions
27.5 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# successive slicing
55.7 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now the overhead of creating the intermediate DataFrame is not negligible.