I have a large dataframe with 30 columns and I want to group duplicated rows together and create a new column with their group_id.
df = pd.DataFrame(
[('a', 'b', 'c', 'd'),
('c', 'e', 'g', 'b'),
('a', 'b', 'c', 'd'),
('c', 'e', 'g', 'b')],
columns = ['P1', 'P2', 'P3', 'P4'] )
df = pd.DataFrame(
[('a', 'b', 'c', 'd', 0),
('c', 'e', 'g', 'b', 1),
('a', 'b', 'c', 'd', 0),
('c', 'e', 'g', 'b', 1)],
columns = ['P1', 'P2', 'P3', 'P4', 'DuplicateGroup'] )
My attempt is to group the entire dataframe but I get an empty series:
df.groupby(df.columns.tolist()).size()
>>
Series([], dtype: int64)
Please note that it has quite some columns so I prefer to not type them manually.
CodePudding user response:
Rather than a groupby, you can drop the duplicates of the first frame and merge it back with your original DF, using the resulting index as your "group number"
pd.merge(left=df,
right=df.drop_duplicates(df.columns.tolist(), keep='first').reset_index(),
on=df.columns.tolist(),
how='left',sort=False).rename({'index':'group'}, axis=1)
P1 P2 P3 P4 group
0 a b c d 0
1 c e g b 1
2 a b c d 0
3 c e g b 1
CodePudding user response:
A simple groupby.ngroup
should do the trick:
df['DuplicateGroup'] = df.groupby(list(df)).ngroup()
Output:
P1 P2 P3 P4 DuplicateGroup
0 a b c d 0
1 c e g b 1
2 a b c d 0
3 c e g b 1