Home > Enterprise >  Subsetting pandas dataframe results in apparently incorrect size based
Subsetting pandas dataframe results in apparently incorrect size based

Time:11-21

I am attempting to subset a pandas DatFrame df with a list L that contains only the column names in the DataFrame that I am interested in. The shape of df is (207, 8440) and the length of L is 6894. When I subset my dataframe as df[L] (or df.loc[:, L]), I get a bizarre result. The expected shape of the resultant DataFrame should be (207, 6894), but instead I get (207, 7092).

It seems that this should not even be possible. Can anyone explain this behavior?

CodePudding user response:

[moving from comment to answer]

A pandas dataframe can have multiple columns with the exact same name. If this happens, passing a list of column names can return more columns than the size of the list.

You can check if the dataframe has duplicates in the column names using {col for col in df.columns if list(df.columns).count(col) > 1} This will return a set of every column that that comes up more than once.

  • Related