Home > other >  Remove duplicates based on combination of two columns in Pandas
Remove duplicates based on combination of two columns in Pandas

Time:12-07

I need to delete duplicated rows based on combination of two columns (person1 and person2 columns) which have strings. For example person1: ryan and person2: delta or person 1: delta and person2: ryan is same and provides the same value in messages column. Need to drop one of these two rows. Return the non duplicated rows as well.

Code to recreate df 
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
                     "person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"], 
                     "person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"], 
                     "messages": [1, 1, 2, 3,3,9,9]})
 df
        person1 person2 messages
0   0   ryan    delta   1
1   1   delta   ryan    1
2   2   delta   alpha   2
3   3   delta   bravo   3
4   4   bravo   delta   3
5   5   alpha   ryan    9
6   6   ryan    alpha   9

Answer df should be:

 finaldf
        person1 person2 messages
0   0   ryan    delta   1
1   2   delta   alpha   2
2   3   delta   bravo   3
3   5   alpha   ryan    9

CodePudding user response:

Try as follows:

res = (df[~df.filter(like='person').apply(frozenset, axis=1).duplicated()]
       .reset_index(drop=True))

print(res)

     person1 person2  messages
0  0    ryan   delta         1
1  2   delta   alpha         2
2  3   delta   bravo         3
3  5   alpha    ryan         9

Explanation

  • First, we use df.filter to select just the columns with person*.
  • For these columns only we use df.apply to turn each row (axis=1) into a frozenset. So, at this stage, we are looking at a pd.Series like this:
0     (ryan, delta)
1     (ryan, delta)
2    (alpha, delta)
3    (bravo, delta)
4    (bravo, delta)
5     (alpha, ryan)
6     (alpha, ryan)
dtype: object
  • Now, we want to select the duplicate rows, using Series.duplicated and add ~ as a prefix to the resulting boolean series to select the inverse from the original df.
  • Finally, we reset the index with df.reset_index.

CodePudding user response:

Here's a less general approach than the one given by @ouroboros1, this only works for your two columns case

#make a Series of strings of min of p1/p2 concat to max of p1/p2  
sorted_p1p2 = df[['person1','person2']].min(axis=1) '_' df[['person1','person2']].max(axis=1)

#subset to non-dup from the Series
dedup_df = df[~sorted_p1p2.duplicated()]

CodePudding user response:

You can put the two person columns in order within each row, then drop duplicates.

import pandas as pd

df = pd.DataFrame({"": [0,1,2,3,4,5,6],
                     "person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
                     "person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
                     "messages": [1, 1, 2, 3,3,9,9]})

print(df)
swap = df['person1'] < df['person2']
df.loc[swap, ['person1', 'person2']] = df.loc[swap, ['person2', 'person1']].values

df = df.drop_duplicates(subset=['person1', 'person2'])

print(df)

After the swap:

     person1 person2  messages
0  0    ryan   delta         1
1  1    ryan   delta         1
2  2   delta   alpha         2
3  3   delta   bravo         3
4  4   delta   bravo         3
5  5    ryan   alpha         9
6  6    ryan   alpha         9

After dropping duplicates:

     person1 person2  messages
0  0    ryan   delta         1
2  2   delta   alpha         2
3  3   delta   bravo         3
5  5    ryan   alpha         9
  • Related