I'm working with a large dataset (921600 rows, 23 columns) with the occasional duplicate column (different column names however). I would like to remove the columns with identical values. However, 'df.T.drop_duplicates().T' and similar solutions simply take too long as they presumably are checking all 921600 rows. Is it possible to remove columns if just the first x amount of rows have identical values?
E.g.: Identify that 'channel2' and 'channel2-2' are duplicate by comparing the first x (say 10) rows instead of inspecting all million rows.
channel1 channel2 channel3 channel2-b
0 47 46 27 46
1 84 28 28 28
2 72 79 68 79
... ... ... ... ...
999997 4729 1957 2986 1957
999998 9918 1513 2957 1513
999999 1001 5883 7577 5883
CodePudding user response:
Use DataFrame.duplicated
with filter top values in DataFrame.head
, filter rows by DataFrame.loc
:
N = 2
df = df.loc[:, ~df.head(N).T.duplicated()]