There is a dataframe that I need to drop the duplicate lines excluding one column. Columns 'ONE' and 'TWO' have strings and 'THREE' has lists. Delete duplicate lines excluding column 'TWO' but keep the first row if duplicate. Ex:
ONE TWO THREE
1 A A1 ['1','2','3']
2 A A2 ['3','2','1']
3 B B1 ['1','2']
4 B B2 ['1','2','3']
5 C C1 ['1','2','3']
6 C C2 ['3','2','1']
7 C C3 ['1','2']
8 C C4 ['2','1']
- Row 2 is duplicate of 1 because column 'THREE' has list with same elements in both rows, so row 1 is kept and 2 deleted.
- Rows 3 and 4 are kept because column 'THREE' has list with different elements
- Row 5 is kept and 6 deleted
- Row 7 is kept and 8 deleted
Result:
ONE TWO THREE
1 A A1 ['1','2','3']
3 B B1 ['1','2']
4 B B2 ['1','2','3']
5 C C1 ['1','2','3']
7 C C3 ['1','2']
I understand that you can not use: df=df.drop_duplicates(subset=df.columns.difference(['TWO ']))
because the lists in column 'THREE' are unhashable and we could do that if column 'THREE' was a string but then the order of the elements would matter.
Also df=[list(i) for i in {*[tuple(sorted(i)) for i in df]}]
can only be used in a list of lists but then again I could be using it wrong.
What is the right direction to take?
CodePudding user response:
Turn the 'THREE' list values into frozensets
using Series.map
so the order of the items doesn't matter (assuming they are not necessarily sorted already) and the values are hashable (as drop_duplicates
requires). A frozenset
is just like a normal set
but immutable and hashable.
# if the order of the items in each list matters to consider them as duplicates
# use df['THREE'].map(tuple) instead
df['THREE'] = df['THREE'].map(frozenset)
df = df.drop_duplicates(subset=['ONE', 'THREE']))
>>> df
ONE TWO THREE
1 A A1 (2, 3, 1)
3 B B1 (2, 1)
4 B B2 (2, 3, 1)
5 C C1 (2, 3, 1)
7 C C3 (2, 1)
If you want, you can convert the 'THREE' values back to lists using
df['THREE'] = df['THREE'].map(list)
To avoid remapping the 'THREE' values to lists you can instead create a temporary column (temp
), and drop it at the end
df = (
df.assign(temp = df['THREE'].map(frozenset))
.drop_duplicates(['ONE', 'temp'])
.drop(columns='temp')
)
>>> df
ONE TWO THREE
1 A A1 ['1','2','3']
3 B B1 ['1','2']
4 B B2 ['1','2','3']
5 C C1 ['1','2','3']
7 C C3 ['1','2']