I am attempting to subset a pandas DatFrame
df with a list
L that contains only the column names in the DataFrame that I am interested in. The shape of df is (207, 8440)
and the length of L is 6894
. When I subset my dataframe as df[L] (or df.loc[:, L]), I get a bizarre result. The expected shape of the resultant DataFrame should be (207, 6894)
, but instead I get (207, 7092)
.
It seems that this should not even be possible. Can anyone explain this behavior?
CodePudding user response:
[moving from comment to answer]
A pandas dataframe can have multiple columns with the exact same name. If this happens, passing a list of column names can return more columns than the size of the list.
You can check if the dataframe has duplicates in the column names using {col for col in df.columns if list(df.columns).count(col) > 1}
This will return a set of every column that that comes up more than once.