df = pd.DataFrame({'A':[1,2,3,1,2,3,4,1,2,3], 'B':[5,6,7,5,6,7,8,5,6,7], 'C':['x','x','x','y','y','y','y','z','z','z']})
I'm trying to get rid of duplicate groups of rows when considering grouping variable C
.
In my example, only x
and z
are duplicates, because y has an extra row that makes it different from x
.
So the output should be:
A B C
0 1 5 x
1 2 6 x
2 3 7 x
3 1 5 y
4 2 6 y
5 3 7 y
6 4 8 y
the methods I tried (.duplicated
, .drop_duplicates
) seem to work well when comparing single rows, but not group of rows.
CodePudding user response:
In your case we can groupby
with tuple
first
out = df.loc[df['C'].isin(df.groupby('C').agg(tuple).drop_duplicates().index),]
Out[13]:
A B C
0 1 5 x
1 2 6 x
2 3 7 x
3 1 5 y
4 2 6 y
5 3 7 y
6 4 8 y
CodePudding user response:
Another approach using pandas.util.hash_pandas_object
to compute a hash of each group.
idx = (df.drop(columns='C')
.groupby(df['C'])
.apply(lambda g: pd.util.hash_pandas_object(g, index=False).agg(tuple))
.drop_duplicates().index
)
out = df[df['C'].isin(idx)]
output:
A B C
0 1 5 x
1 2 6 x
2 3 7 x
3 1 5 y
4 2 6 y
5 3 7 y
6 4 8 y
It is more verbose than the nice and short code of @BENY but significantly more efficient on many columns.
Timing on 400 columns:
# dataframe hash
54.5 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# tuple aggregation
255 ms ± 22.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)