I have a DataFrame with columns with duplicate data with different name:
In[1]: df
Out[1]:
X1 X2 Y1 Y2
0.0 0.0 6.0 6.0
3.0 3.0 7.1 7.1
7.6 7.6 1.2 1.2
I know .drop(columns = ) exists but is there a way more efficient way to drop these without having to list down the column names? or not.. please let me know as i can just use .drop()
CodePudding user response:
You could transpose with T
and drop_duplicates
then transpose back:
>>> df.T.drop_duplicates().T
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
>>>
Or with loc
and duplicated
:
>>> df.loc[:, df.T.duplicated(keep='last')]
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
>>>
CodePudding user response:
We can use np.unique
over axis 1. Unfortunately, there's no pandas built-in function to drop duplicate columns.
df.drop_duplicates
only removes duplicate rows.
Return DataFrame with duplicate rows removed.
We can create a function around np.unique
to drop duplicate columns.
def drop_duplicate_cols(df):
uniq, idxs = np.unique(df, return_index=True, axis=1)
return pd.DataFrame(uniq, index=df.index, columns=df.columns[idxs])
drop_duplicate_cols(X)
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2