I have the following line of code
end_df['Soma Internet'] = end_df.iloc[:,end_df.columns.get_level_values(1) == 'Internet'].drop('site',axis=1).sum(axis=1)
It basically, filts my multi index df by a specific level 1 column. Drops a few not wanted columns. And does the sum, of all the other ones.
I took a glance, at a few of the documentation and other asked questions. But i didnt quite understood what causes the warning, and i also would love to rewrite this code, so i get rid of it.
CodePudding user response:
Let's try with an example (without data for simplicity):
# Column MultiIndex.
idx = pd.MultiIndex(levels=[['Col1', 'Col2', 'Col3'], ['subcol1', 'subcol2']],
codes=[[2, 1, 0], [0, 1, 1]])
df = pd.DataFrame(columns=range(len(idx)))
df.columns = idx
print(df)
Col3 Col2 Col1
subcol1 subcol2 subcol2
Clearly, the column MultiIndex
is not sorted. We can check it with:
print(df.columns.is_monotonic)
False
This matters because Pandas performs index lookup and other operations much faster if the index is sorted, because it can use operations that assume the sorted order and are faster. Indeed, if we try to drop a column:
df.drop('Col1', axis=1)
PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
df.drop('Col1', axis=1)
Instead, if we sort the index before dropping, the warning disappears:
print(df.sort_index(axis=1))
# Index is now sorted in lexical order.
Col1 Col2 Col3
subcol2 subcol2 subcol1
# No warning here.
df.sort_index(axis=1).drop('Col1', axis=1)
EDIT (see comments): As the warning suggests, this happens when we do not specify the level from which we want to drop the column. This is because to drop the column, pandas has to traverse the whole non-sorted index (happens here). By specifying it we do not need such traversal:
# Also no warning.
df.drop('Col1', axis=1, level=0)
However, in general this problem relates more on row indices, as usually column multi-indices are way smaller. But definitely to keep it in mind for larger indices and dataframes. In fact, this is in particular relevant for slicing by index and for lookups. In those cases, you want your index to be sorted for better performance.