Whilst I know both yield the same result which is more efficient and why.
#dataframe with shape 20,20
#slicing the first 10 columns
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(400).reshape(20,20))
df.T[:10].T
#or
df.iloc[:,:10]
It's likely that the difference is negligible and the iloc
is best practice because it is more readable. I'd like to know some pros and cons.
CodePudding user response:
.iloc[]
has been specially designed to be as efficient as possible. Performing two transposes involves a lot of data movement and is bound to be slower.
The performance difference between the two is measurable, and gets more significant as the size of the dataframe increases. Using timeit.timeit()
to measure timings:
For a small array:
>>> df = pd.DataFrame(np.arange(400).reshape(20,20))
>>> timeit("x = df.T[:10].T", globals=globals(), number=100)
0.04253590002190322
>>> timeit("x = df.iloc[:,:10]", globals=globals(), number=100)
0.006828900019172579
For a large array, the difference is more noticeable:
>>> df = pd.DataFrame(np.arange(400000000).reshape(20000,20000))
>>> timeit("x = df.T[:10].T", globals=globals(), number=100)
0.5803892000112683
>>> timeit("x = df.iloc[:,:10]", globals=globals(), number=100)
0.00561390002258122
That's about 100x slower for the transpose approach.