Home > database >  Get duplicate value rows with respect to column of list
Get duplicate value rows with respect to column of list

Time:08-24

i am trying to find duplicate rows with respect to the column which contain list. But unfortunately I doesn't get my expected result. The model dataframe what i used is ,

df = pd.DataFrame(
{
    "author": ["Jefe9", "Jefe98", "Alex", "Alex", "Qbert"],
    "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
    "ingredients": [
        ["ingredA", "ingredB", "ingredC"],
        ["ingredA", "ingredB", "ingredC"],
        ["ingredA", "ingredB", "ingredD"],
        ["ingredA", "ingredB", "ingredD", "ingredE"],
        ["ingredB", "ingredC", "ingredF"],
    ],
}
)

the model dataframe is,

    author  date    ingredients
0   Jefe9   1423112400  [ingredA, ingredB, ingredC]
1   Jefe98  1423112400  [ingredA, ingredB, ingredC]
2   Alex    1603112400  [ingredA, ingredB, ingredD]
3   Alex    1423115600  [ingredA, ingredB, ingredD, ingredE]
4   Qbert   1663526834  [ingredB, ingredC, ingredF]

the expected output is,

author  date    ingredients
0   Jefe9   1423112400  [ingredA, ingredB, ingredC]
1   Jefe98  1423112400  [ingredA, ingredB, ingredC]

The code i had tried is,

df[df.duplicated(['ingredients'])]

It gave error because it expecting for a single unit or elemental value for finding duplicate. Thanks in advance

CodePudding user response:

You can turn ingredients column value to tuple first

out = df[(df.assign(ingredients=df['ingredients'].apply(lambda x: tuple(sorted(x))))
          .duplicated(['ingredients'], keep=False))]
print(out)

   author        date                  ingredients
0   Jefe9  1423112400  [ingredA, ingredB, ingredC]
1  Jefe98  1423112400  [ingredA, ingredB, ingredC]
  • Related