Home > front end >  Detecting duplicates in pandas when a column contains lists
Detecting duplicates in pandas when a column contains lists

Time:05-26

Is there a reasonable way to detect duplicates in a Pandas dataframe when a column contains lists or numpy nd arrays, like the example below? I know I could convert the lists into strings, but the act of converting back and forth feels... wrong. Plus, lists seem more legible and convenient given ~how I got here (online code) and where I'm going after.

import pandas as pd

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "ingredients": [
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredD"],
            ["ingredA", "ingredB", "ingredD", "ingredE"],
            ["ingredB", "ingredC", "ingredF"],
        ],
    }
)

# Traditional find duplicates
# df[df.duplicated(keep=False)]

# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]

Both methods (the latter from this alternative find duplicates answer) result in

TypeError: unhashable type: 'list'.

They would work, of course, if the dataframe looked like this:

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "recipe": [
            "recipeC",
            "recipeC",
            "recipeD",
            "recipeE",
            "recipeF",
        ],
    }
)

Which made me wonder if something like integer encoding might be reasonable? It's not that different from converting to/from strings, but at least it's legible. Alternatively, suggestions for converting to a single string of ingredients per row directly from the starting dataframe in the code link above would be appreciated (i.e., avoiding lists altogether).

CodePudding user response:

With map tuple

out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]: 
   author        date                       rating
0  Jefe98  1423112400  [ingredA, ingredB, ingredC]
1  Jefe98  1423112400  [ingredA, ingredB, ingredC]
  • Related