Home > Back-end >  exclude duplicate groups of rows given grouping variable
exclude duplicate groups of rows given grouping variable

Time:10-15

df = pd.DataFrame({'A':[1,2,3,1,2,3,4,1,2,3], 'B':[5,6,7,5,6,7,8,5,6,7], 'C':['x','x','x','y','y','y','y','z','z','z']})

I'm trying to get rid of duplicate groups of rows when considering grouping variable C.

In my example, only x and z are duplicates, because y has an extra row that makes it different from x.

So the output should be:

   A  B  C
0  1  5  x
1  2  6  x
2  3  7  x
3  1  5  y
4  2  6  y
5  3  7  y
6  4  8  y

the methods I tried (.duplicated, .drop_duplicates) seem to work well when comparing single rows, but not group of rows.

CodePudding user response:

In your case we can groupby with tuple first

out = df.loc[df['C'].isin(df.groupby('C').agg(tuple).drop_duplicates().index),]
Out[13]: 
   A  B  C
0  1  5  x
1  2  6  x
2  3  7  x
3  1  5  y
4  2  6  y
5  3  7  y
6  4  8  y

CodePudding user response:

Another approach using pandas.util.hash_pandas_object to compute a hash of each group.

idx = (df.drop(columns='C')
         .groupby(df['C'])
         .apply(lambda g: pd.util.hash_pandas_object(g, index=False).agg(tuple))
         .drop_duplicates().index
        )

out = df[df['C'].isin(idx)]

output:

   A  B  C
0  1  5  x
1  2  6  x
2  3  7  x
3  1  5  y
4  2  6  y
5  3  7  y
6  4  8  y

It is more verbose than the nice and short code of @BENY but significantly more efficient on many columns.

Timing on 400 columns:

# dataframe hash
54.5 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# tuple aggregation
255 ms ± 22.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Related