i am trying to find duplicate rows with respect to the column which contain list. But unfortunately I doesn't get my expected result. The model dataframe what i used is ,
df = pd.DataFrame(
{
"author": ["Jefe9", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"ingredients": [
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredD"],
["ingredA", "ingredB", "ingredD", "ingredE"],
["ingredB", "ingredC", "ingredF"],
],
}
)
the model dataframe is,
author date ingredients
0 Jefe9 1423112400 [ingredA, ingredB, ingredC]
1 Jefe98 1423112400 [ingredA, ingredB, ingredC]
2 Alex 1603112400 [ingredA, ingredB, ingredD]
3 Alex 1423115600 [ingredA, ingredB, ingredD, ingredE]
4 Qbert 1663526834 [ingredB, ingredC, ingredF]
the expected output is,
author date ingredients
0 Jefe9 1423112400 [ingredA, ingredB, ingredC]
1 Jefe98 1423112400 [ingredA, ingredB, ingredC]
The code i had tried is,
df[df.duplicated(['ingredients'])]
It gave error because it expecting for a single unit or elemental value for finding duplicate. Thanks in advance
CodePudding user response:
You can turn ingredients
column value to tuple
first
out = df[(df.assign(ingredients=df['ingredients'].apply(lambda x: tuple(sorted(x))))
.duplicated(['ingredients'], keep=False))]
print(out)
author date ingredients
0 Jefe9 1423112400 [ingredA, ingredB, ingredC]
1 Jefe98 1423112400 [ingredA, ingredB, ingredC]