Home > Blockchain >  How to drop identical columns in Pandas dataframe if first x rows of values are identical?
How to drop identical columns in Pandas dataframe if first x rows of values are identical?

Time:10-24

I'm working with a large dataset (921600 rows, 23 columns) with the occasional duplicate column (different column names however). I would like to remove the columns with identical values. However, 'df.T.drop_duplicates().T' and similar solutions simply take too long as they presumably are checking all 921600 rows. Is it possible to remove columns if just the first x amount of rows have identical values?

E.g.: Identify that 'channel2' and 'channel2-2' are duplicate by comparing the first x (say 10) rows instead of inspecting all million rows.

           channel1 channel2 channel3 channel2-b
0                47       46       27         46
1                84       28       28         28
2                72       79       68         79
...             ...      ...      ...        ...
999997         4729     1957     2986       1957
999998         9918     1513     2957       1513
999999         1001     5883     7577       5883

CodePudding user response:

Use DataFrame.duplicated with filter top values in DataFrame.head, filter rows by DataFrame.loc:

N = 2
df = df.loc[:, ~df.head(N).T.duplicated()]
  • Related