Using a pandas.Dataframe
, how should I remove duplicated (based on multiple columns) rows using the mode of another column ?
import pandas as pd
df = pd.DataFrame(
data={
"col_1": [0, 0, 0, 0, 1, 1, 1, 1],
"col_2": [1, 1, 1, 1, 2, 2, 2, 2],
"col_3": [5, 5, 0, 1, 8, 8, 0, 1],
"another_column": [0, 0, 0, 0, 0, 0, 0, 0],
}
)
# the following line shows the correct answer but doesn't return original dataframe
# with only the two unique rows
print(df.groupby(by=["col_1", "col_2"])["col_3"].agg(lambda x: x.mode()[0]))
CodePudding user response:
Use GroupBy.transform
and compare original column col_3
in boolean indexing
:
s = df.groupby(by=["col_1", "col_2"])["col_3"].transform(lambda x: x.mode()[0])
df1 = df[df['col_3'].eq(s)]
print (df1)
col_1 col_2 col_3 another_column
0 0 1 5 0
1 0 1 5 0
4 1 2 8 0
5 1 2 8 0
If need first row per groups:
s = df.groupby(by=["col_1", "col_2"])["col_3"].transform(lambda x: x.mode()[0])
df1 = df[df['col_3'].eq(s)].drop_duplicates(["col_1", "col_2"])
print (df1)
col_1 col_2 col_3 another_column
0 0 1 5 0
4 1 2 8 0