Say I have the following dataframe
Name Value
foo1 abc
foo2 abc
foo3 efg
foo4 xyz
foo5 abc
There are duplicates. 'foo1', 'foo2', and 'foo3' all have the same value.
How do I get a list of the duplicates? What I'm looking for is this
Name Value Dupes
foo1 abc foo1, foo2, foo5
foo2 abc foo1, foo2, foo5
foo3 efg foo3
foo4 xyz foo4
foo5 abc foo1, foo2, foo5
or
Name Value Dupes
foo1 abc foo2, foo5
foo2 abc foo1, foo5
foo3 efg
foo4 xyz
foo5 abc foo1, foo2
CodePudding user response:
You could use groupby.agg
, then map
:
df['Dupes'] = df['Value'].map(df.groupby('Value')['Name'].agg(', '.join))
Output:
Name Value Dupes
0 foo1 abc foo1, foo2, foo5
1 foo2 abc foo1, foo2, foo5
2 foo3 efg foo3
3 foo4 xyz foo4
4 foo5 abc foo1, foo2, foo5
CodePudding user response:
We do transform
with groupby
df['new'] = df.groupby('Value')['Name'].transform(','.join)
Out[112]:
0 foo1,foo2,foo5
1 foo1,foo2,foo5
2 foo3
3 foo4
4 foo1,foo2,foo5
Name: Name, dtype: object
CodePudding user response:
The second example given differs from the first output example, due to non duplicated values having no names in the Dupes
column. Here is a way to handle that, as well as ensuring only unique names are displayed in the Dupes
column.
(df.assign(Dupes = df['Value'].map((df.groupby('Value')['Name']
.agg(set)
.where(lambda x: x.str.len().gt(1))
.map(','.join,na_action='ignore')))))
Here is an alternative way:
(df.assign(Dups = df.loc[df.duplicated(subset=['Value'],keep=False)]
.groupby('Value')['Name']
.transform(lambda x: ','.join(set(x)))))
Output:
Name Value Dups
0 foo1 abc foo1,foo2,foo5
1 foo2 abc foo1,foo2,foo5
2 foo3 efg NaN
3 foo4 xyz NaN
4 foo5 abc foo1,foo2,foo5
Just noticed that the Dups
column has only the names in the other rows. Try this:
(df.assign(Dups = df['Value'].map(df.groupby('Value')['Name'].agg(set))
.sub(df['Name'].map(lambda x: set(x.split())))
.map(','.join)))
Output:
Name Value Dups
0 foo1 abc foo5,foo2
1 foo2 abc foo5,foo1
2 foo3 efg
3 foo4 xyz
4 foo5 abc foo1,foo2