Pandas df.columns running very slow for large DataFrame when interactively executed in console-CodePudding

EDIT: This question has been heavily edited because the original issue is not valid after running additional tests.

For a DataFrame with roughly 1million rows and only 13 columns, getting the column names by using either one of the following:

print(df.columns)
df.columns
df.columns.values

is very slow (25sec). This happens ONLY WHEN I TYPE IN ABOVE CODE IN CONSOLE. After the slow execution is complete, the next several runs will be instant.

If I save as a script and run the script then all three operations take no time to finish. Seems to be a problem of my IDE.

I'm using DataSpell 2022.2.3 with python 3.9.5 and pandas 1.4.4 on macOS Ventura.

CodePudding user response：

That's because neither of those two print statements actually print the whole Series. They rather print a representation of it:

import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))

print("This is df.column")
print(df.A)
print("This is df.column.values")
print(df.A.values)

will output

This is df.column
0       35
1       76
        ..
9998    74
9999    79
Name: A, Length: 10000, dtype: int64
This is df.column.values
[35 76 91 ... 27 74 79]

As you can see the second representation is much smaller, hence quicker to print. Also it doesn't print auxiliary data like column name, type, etc.

If you will measure performance difference when doing some simple computations, you will find that it's not that big:

%timeit [x**2 for x in df.A]
%timeit [x**2 for x in df.A.values]

3.41 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.09 ms ± 6.18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response：

I think because df.columns.values return only Numpy array and that should be faster instead of considering and scan entire the DataFrame with pandas.core.indexes.base.Index which may contain MultiIndex.