Home > Software design >  Pandas - find other values in duplicates
Pandas - find other values in duplicates

Time:04-07

Say I have the following dataframe

Name  Value
foo1  abc
foo2  abc
foo3  efg
foo4  xyz
foo5  abc

There are duplicates. 'foo1', 'foo2', and 'foo3' all have the same value.

How do I get a list of the duplicates? What I'm looking for is this

Name  Value   Dupes
foo1  abc     foo1, foo2, foo5
foo2  abc     foo1, foo2, foo5
foo3  efg     foo3
foo4  xyz     foo4
foo5  abc     foo1, foo2, foo5

or

Name  Value   Dupes
foo1  abc     foo2, foo5
foo2  abc     foo1, foo5
foo3  efg     
foo4  xyz     
foo5  abc     foo1, foo2

CodePudding user response:

You could use groupby.agg, then map:

df['Dupes'] = df['Value'].map(df.groupby('Value')['Name'].agg(', '.join))

Output:

   Name Value             Dupes
0  foo1   abc  foo1, foo2, foo5
1  foo2   abc  foo1, foo2, foo5
2  foo3   efg              foo3
3  foo4   xyz              foo4
4  foo5   abc  foo1, foo2, foo5

CodePudding user response:

We do transform with groupby

df['new'] = df.groupby('Value')['Name'].transform(','.join)
Out[112]: 
0    foo1,foo2,foo5
1    foo1,foo2,foo5
2              foo3
3              foo4
4    foo1,foo2,foo5
Name: Name, dtype: object

CodePudding user response:

The second example given differs from the first output example, due to non duplicated values having no names in the Dupes column. Here is a way to handle that, as well as ensuring only unique names are displayed in the Dupes column.

(df.assign(Dupes = df['Value'].map((df.groupby('Value')['Name']
                                    .agg(set)
                                    .where(lambda x: x.str.len().gt(1))
                                    .map(','.join,na_action='ignore')))))

Here is an alternative way:

(df.assign(Dups = df.loc[df.duplicated(subset=['Value'],keep=False)]
           .groupby('Value')['Name']
           .transform(lambda x: ','.join(set(x)))))

Output:

   Name Value            Dups
0  foo1   abc  foo1,foo2,foo5
1  foo2   abc  foo1,foo2,foo5
2  foo3   efg             NaN
3  foo4   xyz             NaN
4  foo5   abc  foo1,foo2,foo5

Just noticed that the Dups column has only the names in the other rows. Try this:

(df.assign(Dups = df['Value'].map(df.groupby('Value')['Name'].agg(set))
           .sub(df['Name'].map(lambda x: set(x.split())))
           .map(','.join)))

Output:

   Name Value       Dups
0  foo1   abc  foo5,foo2
1  foo2   abc  foo5,foo1
2  foo3   efg           
3  foo4   xyz           
4  foo5   abc  foo1,foo2
  • Related