Home > Software engineering >  Remove duplicated rows based on a Series of lists in which item order doesn't matter
Remove duplicated rows based on a Series of lists in which item order doesn't matter

Time:02-02

I have a data frame with a lot of columns. But one column has values/lists that look like this:

df =

   val1
   ---------------
0  ["hej", "hello"]
1  ["mus", "mouse"]
2  ["hest", "horse"]
3  ["hello", "hej"]
4  ["mouse", "mus"]

If I just do:

df.drop_duplicates(subset=["val1"], keep="first")

Nothing will get dropped in this case. However, in my case ["hej", "hello"] is the same as ["hello", "hej"] and ["mus", "mouse"] is the same as ["mouse", "mus"]. So in this case it should just keep ["hej", "hello"] and ["mus", "mouse"], ending up with a data frame with the rows:

df_final =

   val1
   ---------------
0  ["hej", "hello"]
1  ["mus", "mouse"]
2  ["hest", "horse"]

Is there any way to accomplish this?

CodePudding user response:

convert to frozenset and use duplicated for boolean indexing:

df_final = df[~df['val1'].apply(frozenset).duplicated()]

Or, if there are duplicated values and the number matters, use a Counter:

from collections import Counter
df_final = df[~df['val1'].apply(Counter).duplicated()]

Output:

            val1
0   [hej, hello]
1   [mus, mouse]
2  [hest, horse]

Difference of the two approaches:

# input
                val1
0       [hej, hello]
1       [mus, mouse]
2      [hest, horse]
3       [hello, hej]
4       [mouse, mus]
5  [mouse, mus, mus]   # two rows with duplicated items
6  [mus, mouse, mus]   #

# output with frozenset
            val1
0   [hej, hello]
1   [mus, mouse]
2  [hest, horse]

# output with Counter
                val1
0       [hej, hello]
1       [mus, mouse]
2      [hest, horse]
5  [mouse, mus, mus]  # this one has 2 "mus"
  • Related