In a dataframe with 2 columns [id][string]
, I need to know which lines are duplicates of which lines based on the value of the column [string]
.
My dataframe has thousands of rows but only 2 columns.
Sample of the input dataframe:
id,string
0,"A B C D"
1,"D B C D E Z"
2,"A B C D"
3,"Z Z Z Z Z Z Z Z Z Z Z Z"
4,"D B C D E Z"
5,"A B C D"
In this sample, rows 0, 2, 5 are duplicates of each other. Also rows 1 and 4 are duplicates of each other. (id is unique)
I want the following output:
[["0","2","5"]],["1","4"]]
CodePudding user response:
I'd use groupby and a listcomp.
>>> df
id string
0 0 A B C D
1 1 D B C D E Z
2 2 A B C D
3 3 Z Z Z Z Z Z Z Z Z Z Z Z
4 4 D B C D E Z
5 5 A B C D
>>>
>>> [l for l in df.groupby('string')['id'].apply(list) if len(l) > 1]
[[0, 2, 5], [1, 4]]
If you really want strings in the result, use
>>> [[str(x) for x in l] for l in df.groupby('string')['id'].apply(list) if len(l) > 1]
[['0', '2', '5'], ['1', '4']]
CodePudding user response:
You can filter by length of lists after aggregate list
per string
in boolean indexing
with Series.str.len
:
s = df.assign(id = df['id'].astype(str)).groupby('string')['id'].apply(list)
out = s[s.str.len().gt(1)].tolist()
If already id
are strings:
s = df.groupby('string')['id'].apply(list)
out = s[s.str.len().gt(1)].tolist()
CodePudding user response:
another option is using duplicated
:
>>> df[df.duplicated('string',False)].groupby('string')['id'].apply(lambda x: x.astype(str).tolist()).tolist()
# [['0', '2', '5'], ['1', '4']]
if there's no need to change 'id' type:
>>> df[df.duplicated('string',False)].groupby('string')['id'].apply(list).tolist()
# [[0, 2, 5], [1, 4]]