I need to delete duplicated rows based on combination of two columns (person1 and person2 columns) which have strings. For example person1: ryan and person2: delta or person 1: delta and person2: ryan is same and provides the same value in messages column. Need to drop one of these two rows. Return the non duplicated rows as well.
Code to recreate df
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
df
person1 person2 messages
0 0 ryan delta 1
1 1 delta ryan 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 bravo delta 3
5 5 alpha ryan 9
6 6 ryan alpha 9
Answer df should be:
finaldf
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9
CodePudding user response:
Try as follows:
res = (df[~df.filter(like='person').apply(frozenset, axis=1).duplicated()]
.reset_index(drop=True))
print(res)
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9
Explanation
- First, we use
df.filter
to select just the columns withperson*
. - For these columns only we use
df.apply
to turn each row (axis=1
) into afrozenset
. So, at this stage, we are looking at apd.Series
like this:
0 (ryan, delta)
1 (ryan, delta)
2 (alpha, delta)
3 (bravo, delta)
4 (bravo, delta)
5 (alpha, ryan)
6 (alpha, ryan)
dtype: object
- Now, we want to select the duplicate rows, using
Series.duplicated
and add~
as a prefix to the resulting boolean series to select the inverse from the originaldf
. - Finally, we reset the index with
df.reset_index
.
CodePudding user response:
Here's a less general approach than the one given by @ouroboros1, this only works for your two columns case
#make a Series of strings of min of p1/p2 concat to max of p1/p2
sorted_p1p2 = df[['person1','person2']].min(axis=1) '_' df[['person1','person2']].max(axis=1)
#subset to non-dup from the Series
dedup_df = df[~sorted_p1p2.duplicated()]
CodePudding user response:
You can put the two person columns in order within each row, then drop duplicates.
import pandas as pd
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
print(df)
swap = df['person1'] < df['person2']
df.loc[swap, ['person1', 'person2']] = df.loc[swap, ['person2', 'person1']].values
df = df.drop_duplicates(subset=['person1', 'person2'])
print(df)
After the swap:
person1 person2 messages
0 0 ryan delta 1
1 1 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 delta bravo 3
5 5 ryan alpha 9
6 6 ryan alpha 9
After dropping duplicates:
person1 person2 messages
0 0 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
5 5 ryan alpha 9