EDIT: This question has been heavily edited because the original issue is not valid after running additional tests.
For a DataFrame with roughly 1million rows and only 13 columns, getting the column names by using either one of the following:
print(df.columns)
df.columns
df.columns.values
is very slow (25sec). This happens ONLY WHEN I TYPE IN ABOVE CODE IN CONSOLE. After the slow execution is complete, the next several runs will be instant.
If I save as a script and run the script then all three operations take no time to finish. Seems to be a problem of my IDE.
I'm using DataSpell 2022.2.3 with python 3.9.5 and pandas 1.4.4 on macOS Ventura.
CodePudding user response:
That's because neither of those two print
statements actually print the whole Series. They rather print a representation of it:
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
print("This is df.column")
print(df.A)
print("This is df.column.values")
print(df.A.values)
will output
This is df.column
0 35
1 76
..
9998 74
9999 79
Name: A, Length: 10000, dtype: int64
This is df.column.values
[35 76 91 ... 27 74 79]
As you can see the second representation is much smaller, hence quicker to print. Also it doesn't print auxiliary data like column name, type, etc.
If you will measure performance difference when doing some simple computations, you will find that it's not that big:
%timeit [x**2 for x in df.A]
%timeit [x**2 for x in df.A.values]
3.41 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.09 ms ± 6.18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CodePudding user response:
I think because df.columns.values
return only Numpy array and that should be faster instead of considering and scan entire the DataFrame with pandas.core.indexes.base.Index
which may contain MultiIndex.