Python Pandas Duplicate Groups all Rows-CodePudding

I have a large dataframe with 30 columns and I want to group duplicated rows together and create a new column with their group_id.


df = pd.DataFrame(

[('a', 'b', 'c', 'd'),
('c', 'e', 'g', 'b'),
('a', 'b', 'c', 'd'),
('c', 'e', 'g', 'b')],

columns = ['P1', 'P2', 'P3', 'P4'] )


df = pd.DataFrame(

[('a', 'b', 'c', 'd', 0),
('c', 'e', 'g', 'b', 1),
('a', 'b', 'c', 'd', 0),
('c', 'e', 'g', 'b', 1)],

columns = ['P1', 'P2', 'P3', 'P4', 'DuplicateGroup'] )

My attempt is to group the entire dataframe but I get an empty series:


df.groupby(df.columns.tolist()).size()

>>

Series([], dtype: int64)

Please note that it has quite some columns so I prefer to not type them manually.

CodePudding user response：

Rather than a groupby, you can drop the duplicates of the first frame and merge it back with your original DF, using the resulting index as your "group number"

pd.merge(left=df,
         right=df.drop_duplicates(df.columns.tolist(), keep='first').reset_index(),
         on=df.columns.tolist(),
         how='left',sort=False).rename({'index':'group'}, axis=1)

    P1  P2  P3  P4  group
0   a   b   c   d   0
1   c   e   g   b   1
2   a   b   c   d   0
3   c   e   g   b   1

CodePudding user response：

A simple groupby.ngroup should do the trick:

df['DuplicateGroup'] = df.groupby(list(df)).ngroup()

Output:

  P1 P2 P3 P4  DuplicateGroup
0  a  b  c  d               0
1  c  e  g  b               1
2  a  b  c  d               0
3  c  e  g  b               1