I want to remove the duplicated values in my pandas
dataFrame.
my DataFram is like this:
# col_1 col_2 col_3 col_4
1 a a 1 1 # unwanted
2 a b 0.7 0.5
3 a c 0.5 0.3
4 b a 0.7 0.5 # Duplicated
5 b b 1 1 # unwanted
6 b c 0.8 0.6
7 c a 0.5 0.3 # Duplicated
8 c b 0.8 0.6 # Duplicated
9 c c 1 1 # unwanted
How we can improve this DataFrame and remove unwanted and duplicated rows?
may you think that this data frame is like a squared matrix and we can use np.tril
but now it's not that
cus we need to calculate the rank of col_3
and col_4
CodePudding user response:
In your case do np.sort
then drop_duplicates
df[['col_1','col_2']] = np.sort(df[['col_1','col_2']].values,axis=1)
out = df.drop_duplicates(['col_1','col_2']).query('col_1!=col_2')
out
Out[118]:
col_1 col_2 col_3 col_4
2 a b 0.7 0.5
3 a c 0.5 0.3
6 b c 0.8 0.6
CodePudding user response:
df.drop_duplicates(subset =['col1', 'col2],
keep = False, inplace = True)