Remove duplicate rows from python dataframe with sublists-CodePudding

There is a dataframe that I need to drop the duplicate lines excluding one column. Columns 'ONE' and 'TWO' have strings and 'THREE' has lists. Delete duplicate lines excluding column 'TWO' but keep the first row if duplicate. Ex:

   ONE  TWO  THREE
1   A   A1   ['1','2','3']
2   A   A2   ['3','2','1']
3   B   B1   ['1','2']
4   B   B2   ['1','2','3']
5   C   C1   ['1','2','3']
6   C   C2   ['3','2','1']
7   C   C3   ['1','2']
8   C   C4   ['2','1']

Row 2 is duplicate of 1 because column 'THREE' has list with same elements in both rows, so row 1 is kept and 2 deleted.
Rows 3 and 4 are kept because column 'THREE' has list with different elements
Row 5 is kept and 6 deleted
Row 7 is kept and 8 deleted

Result:

   ONE  TWO  THREE
1   A   A1   ['1','2','3']
3   B   B1   ['1','2']
4   B   B2   ['1','2','3']
5   C   C1   ['1','2','3']
7   C   C3   ['1','2']

I understand that you can not use: df=df.drop_duplicates(subset=df.columns.difference(['TWO '])) because the lists in column 'THREE' are unhashable and we could do that if column 'THREE' was a string but then the order of the elements would matter.

Also df=[list(i) for i in {*[tuple(sorted(i)) for i in df]}] can only be used in a list of lists but then again I could be using it wrong.

What is the right direction to take?

CodePudding user response：

Turn the 'THREE' list values into frozensets using Series.map so the order of the items doesn't matter (assuming they are not necessarily sorted already) and the values are hashable (as drop_duplicates requires). A frozenset is just like a normal set but immutable and hashable.

# if the order of the items in each list matters to consider them as duplicates  
# use df['THREE'].map(tuple) instead
df['THREE'] = df['THREE'].map(frozenset)
df = df.drop_duplicates(subset=['ONE', 'THREE']))

>>> df

  ONE TWO      THREE
1   A  A1  (2, 3, 1)
3   B  B1     (2, 1)
4   B  B2  (2, 3, 1)
5   C  C1  (2, 3, 1)
7   C  C3     (2, 1)

If you want, you can convert the 'THREE' values back to lists using

df['THREE'] = df['THREE'].map(list)

To avoid remapping the 'THREE' values to lists you can instead create a temporary column (temp), and drop it at the end

df = (
    df.assign(temp = df['THREE'].map(frozenset))
      .drop_duplicates(['ONE', 'temp']) 
      .drop(columns='temp')
)

>>> df

  ONE TWO          THREE
1   A  A1  ['1','2','3']
3   B  B1      ['1','2']
4   B  B2  ['1','2','3']
5   C  C1  ['1','2','3']
7   C  C3      ['1','2']