Home > database >  How to drop duplicate data with different column names in pandas?
How to drop duplicate data with different column names in pandas?

Time:09-25

I have a DataFrame with columns with duplicate data with different name:

In[1]: df
Out[1]: 
  X1   X2  Y1   Y2
 0.0  0.0  6.0  6.0
 3.0  3.0  7.1  7.1
 7.6  7.6  1.2  1.2

I know .drop(columns = ) exists but is there a way more efficient way to drop these without having to list down the column names? or not.. please let me know as i can just use .drop()

CodePudding user response:

You could transpose with T and drop_duplicates then transpose back:

>>> df.T.drop_duplicates().T
    X1   Y1
0  0.0  6.0
1  3.0  7.1
2  7.6  1.2
>>> 

Or with loc and duplicated:

>>> df.loc[:, df.T.duplicated(keep='last')]
    X1   Y1
0  0.0  6.0
1  3.0  7.1
2  7.6  1.2
>>> 

CodePudding user response:

We can use np.unique over axis 1. Unfortunately, there's no pandas built-in function to drop duplicate columns.

df.drop_duplicates only removes duplicate rows.

Return DataFrame with duplicate rows removed.

We can create a function around np.unique to drop duplicate columns.

def drop_duplicate_cols(df):
    uniq, idxs = np.unique(df, return_index=True, axis=1)
    return pd.DataFrame(uniq, index=df.index, columns=df.columns[idxs])

drop_duplicate_cols(X)
    X1   Y1
0  0.0  6.0
1  3.0  7.1
2  7.6  1.2
  • Related