How to drop duplicates columns from a pandas dataframe, based on columns' values (columns don&#-CodePudding

I want to drop columns if the values inside of them are the same as other columns. From DF, it should yields DF_new:

DF = pd.DataFrame(index=[1,2,3,4], columns = ['col1', 'col2','col3','col4','col5'])
x = np.random.uniform(size=4)
DF['col1'] = x
DF['col2'] = x 2
DF['col3'] = x
DF ['col4'] = x 2
DF['col5'] = [5,6,7,8]
display(DF)

DF_new = DF[['col1', 'col2', 'col5']]
display(DF_new)

Simple example of what I can't manage to do:

Note that the column names are not the same, so I can't use:

DF_new = DF.loc[:,~DF.columns.duplicated()].copy()

, which drop columns based on their names.

CodePudding user response：

You can use:

df = df.T.drop_duplicates().T

Step by step:

df2 = df.T # T = transpose (convert rows to columns)

            1         2         3         4
col1  0.67075  0.707864  0.206923  0.168023
col2  2.67075  2.707864  2.206923  2.168023
col3  0.67075  0.707864  0.206923  0.168023
col4  2.67075  2.707864  2.206923  2.168023
col5  5.00000  6.000000  7.000000  8.000000

#now we can use drop duplicates

df2=df2.drop_duplicates()
'''
            1         2         3         4
col1  0.67075  0.707864  0.206923  0.168023
col2  2.67075  2.707864  2.206923  2.168023
col5  5.00000  6.000000  7.000000  8.000000
'''

#then use transpose again.
df2=df2.T
'''
       col1      col2  col5
1  0.670750  2.670750   5.0
2  0.707864  2.707864   6.0
3  0.206923  2.206923   7.0
4  0.168023  2.168023   8.0
'''

CodePudding user response：

this should do what you need

df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()

as you can see from this link