I know pandas provide various ways to index data, I wanted to know is there a difference between the following two methods from the perspective of performance i.e. which one is faster or both the same?
# method 1
df = table.loc[table.some_col==True, :]
# method 2
df = table[table.some_col==True]
CodePudding user response:
Second is a bit faster, for me it has sense, because first solution is combination DataFrame.loc
and boolean indexing
, second only boolean indexing
:
np.random.seed(2021)
table = pd.DataFrame(np.random.rand(10**7, 5), columns=list('abcde'))
table['some_col'] = table.a > 0.6
In [130]: %timeit table.loc[table.some_col==True, :]
258 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [131]: %timeit df = table[table.some_col==True]
241 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)