I have a data frame with a lot of columns. But one column has values/lists that look like this:
df =
val1
---------------
0 ["hej", "hello"]
1 ["mus", "mouse"]
2 ["hest", "horse"]
3 ["hello", "hej"]
4 ["mouse", "mus"]
If I just do:
df.drop_duplicates(subset=["val1"], keep="first")
Nothing will get dropped in this case. However, in my case ["hej", "hello"]
is the same as ["hello", "hej"]
and ["mus", "mouse"]
is the same as ["mouse", "mus"]
. So in this case it should just keep ["hej", "hello"]
and ["mus", "mouse"]
, ending up with a data frame with the rows:
df_final =
val1
---------------
0 ["hej", "hello"]
1 ["mus", "mouse"]
2 ["hest", "horse"]
Is there any way to accomplish this?
CodePudding user response:
convert to frozenset
and use duplicated
for boolean indexing:
df_final = df[~df['val1'].apply(frozenset).duplicated()]
Or, if there are duplicated values and the number matters, use a Counter
:
from collections import Counter
df_final = df[~df['val1'].apply(Counter).duplicated()]
Output:
val1
0 [hej, hello]
1 [mus, mouse]
2 [hest, horse]
Difference of the two approaches:
# input
val1
0 [hej, hello]
1 [mus, mouse]
2 [hest, horse]
3 [hello, hej]
4 [mouse, mus]
5 [mouse, mus, mus] # two rows with duplicated items
6 [mus, mouse, mus] #
# output with frozenset
val1
0 [hej, hello]
1 [mus, mouse]
2 [hest, horse]
# output with Counter
val1
0 [hej, hello]
1 [mus, mouse]
2 [hest, horse]
5 [mouse, mus, mus] # this one has 2 "mus"