delete redundant rows in a dataframe with set in columns-CodePudding

I have a dataframe df:

Cluster OsId    BrowserId   PageId  VolumePred  ConversionPred
0   11  11  {789615, 955761, 1149586, 955764, 955767, 1187...   147.0   71.0
1   0   11  12  {1184903, 955761, 1149586, 1158132, 955764, 10...   73.0    38.0
2   0   11  15  {1184903, 1109643, 955761, 955764, 1074581, 95...   72.0    40.0
3   0   11  16  {1123200, 1184903, 1109643, 1018637, 1005581, ...   7815.0  5077.0
4   0   11  17  {1184903, 789615, 1016529, 955761, 955764, 955...   52.0    47.0
... ... ... ... ... ... ...
307 {0, 4, 7, 9, 12, 15, 18, 21}    99  16  1154705 220.0   182.0
308 {18}    99  16  1155314 12.0    6.0
309 {9} 99  16  1158132 4.0 4.0
310 {0, 4, 7, 9, 12, 15, 18, 21}    99  16  1184903 966.0   539.0

This dataframe contains redundansts rows that I need to delete them , so I try this :

df.drop_duplicates()

But I got this error : TypeError: unhashable type: 'set'

Any idea to help me to fix this error? Thanks!

CodePudding user response：

Use frozensets for avoid unhashable sets type with DataFrame.duplicated and filter in boolean indexing with invert mask by ~:

#sets are in any column
df1 = df.applymap(lambda x: frozenset(x) if isinstance(x, set) else x)
df[~df1.duplicated()]

If no row was removed it means no row has duplicates (tested are all columns together)