Home > Enterprise >  Pandas : remove duplicated rows using mode
Pandas : remove duplicated rows using mode

Time:04-20

Using a pandas.Dataframe, how should I remove duplicated (based on multiple columns) rows using the mode of another column ?

import pandas as pd

df = pd.DataFrame(
    data={
        "col_1": [0, 0, 0, 0, 1, 1, 1, 1],
        "col_2": [1, 1, 1, 1, 2, 2, 2, 2],
        "col_3": [5, 5, 0, 1, 8, 8, 0, 1],
        "another_column": [0, 0, 0, 0, 0, 0, 0, 0],
    }
)

# the following line shows the correct answer but doesn't return original dataframe
# with only the two unique rows
print(df.groupby(by=["col_1", "col_2"])["col_3"].agg(lambda x: x.mode()[0]))

CodePudding user response:

Use GroupBy.transform and compare original column col_3 in boolean indexing:

s = df.groupby(by=["col_1", "col_2"])["col_3"].transform(lambda x: x.mode()[0])
df1 = df[df['col_3'].eq(s)]
print (df1)
   col_1  col_2  col_3  another_column
0      0      1      5               0
1      0      1      5               0
4      1      2      8               0
5      1      2      8               0

If need first row per groups:

s = df.groupby(by=["col_1", "col_2"])["col_3"].transform(lambda x: x.mode()[0])
df1 = df[df['col_3'].eq(s)].drop_duplicates(["col_1", "col_2"])
print (df1)
   col_1  col_2  col_3  another_column
0      0      1      5               0
4      1      2      8               0
  • Related