When indexing a DataFrame with a boolean mask, is it faster to apply the masks sequentially?-CodePudding

Given a large DataFrame df, which is faster in general?

# combining the masks first
sub_df = df[(df["column1"] < 5) & (df["column2"] > 10)]

# applying the masks sequentially
sub_df = df[df["column1"] < 5]
sub_df = sub_df[sub_df["column2"] > 10]

The first approach only selects from the DataFrame once which may be faster, however, the second selection in the second example only has to consider a smaller DataFrame.

CodePudding user response：

It depends on your dataset.

First let's generate a DataFrame with almost all values that should be dropped in the first condition:

n = 1_000_000
p = 0.0001
np.random.seed(0)
df = pd.DataFrame({'column1': np.random.choice([0,6], size=n, p=[p, 1-p]),
                   'column2': np.random.choice([0,20], size=n)})

And as expected:

# simultaneous conditions
5.69 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# successive slicing
2.99 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It is faster to first generate a small intermediate.

Now, let's change the probability to p = 0.9999. This means that the first condition will remove very few rows.

We could expect both solutions to run with a similar speed, but:

# simultaneous conditions
27.5 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# successive slicing
55.7 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now the overhead of creating the intermediate DataFrame is not negligible.