Home > Enterprise >  delete redundant rows in a dataframe with set in columns
delete redundant rows in a dataframe with set in columns

Time:03-01

I have a dataframe df:

Cluster OsId    BrowserId   PageId  VolumePred  ConversionPred
0   11  11  {789615, 955761, 1149586, 955764, 955767, 1187...   147.0   71.0
1   0   11  12  {1184903, 955761, 1149586, 1158132, 955764, 10...   73.0    38.0
2   0   11  15  {1184903, 1109643, 955761, 955764, 1074581, 95...   72.0    40.0
3   0   11  16  {1123200, 1184903, 1109643, 1018637, 1005581, ...   7815.0  5077.0
4   0   11  17  {1184903, 789615, 1016529, 955761, 955764, 955...   52.0    47.0
... ... ... ... ... ... ...
307 {0, 4, 7, 9, 12, 15, 18, 21}    99  16  1154705 220.0   182.0
308 {18}    99  16  1155314 12.0    6.0
309 {9} 99  16  1158132 4.0 4.0
310 {0, 4, 7, 9, 12, 15, 18, 21}    99  16  1184903 966.0   539.0

This dataframe contains redundansts rows that I need to delete them , so I try this :

df.drop_duplicates()

But I got this error : TypeError: unhashable type: 'set'

Any idea to help me to fix this error? Thanks!

CodePudding user response:

Use frozensets for avoid unhashable sets type with DataFrame.duplicated and filter in boolean indexing with invert mask by ~:

#sets are in any column
df1 = df.applymap(lambda x: frozenset(x) if isinstance(x, set) else x)
df[~df1.duplicated()]

If no row was removed it means no row has duplicates (tested are all columns together)

  • Related