I want to drop columns if the values inside of them are the same as other columns. From DF, it should yields DF_new:
DF = pd.DataFrame(index=[1,2,3,4], columns = ['col1', 'col2','col3','col4','col5'])
x = np.random.uniform(size=4)
DF['col1'] = x
DF['col2'] = x 2
DF['col3'] = x
DF ['col4'] = x 2
DF['col5'] = [5,6,7,8]
display(DF)
DF_new = DF[['col1', 'col2', 'col5']]
display(DF_new)
Simple example of what I can't manage to do:
Note that the column names are not the same, so I can't use:
DF_new = DF.loc[:,~DF.columns.duplicated()].copy()
, which drop columns based on their names.
CodePudding user response:
You can use:
df = df.T.drop_duplicates().T
Step by step:
df2 = df.T # T = transpose (convert rows to columns)
1 2 3 4
col1 0.67075 0.707864 0.206923 0.168023
col2 2.67075 2.707864 2.206923 2.168023
col3 0.67075 0.707864 0.206923 0.168023
col4 2.67075 2.707864 2.206923 2.168023
col5 5.00000 6.000000 7.000000 8.000000
#now we can use drop duplicates
df2=df2.drop_duplicates()
'''
1 2 3 4
col1 0.67075 0.707864 0.206923 0.168023
col2 2.67075 2.707864 2.206923 2.168023
col5 5.00000 6.000000 7.000000 8.000000
'''
#then use transpose again.
df2=df2.T
'''
col1 col2 col5
1 0.670750 2.670750 5.0
2 0.707864 2.707864 6.0
3 0.206923 2.206923 7.0
4 0.168023 2.168023 8.0
'''
CodePudding user response:
this should do what you need
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
as you can see from this link