I have come across a perhaps very specific problem, I have the following data:
Input
id | name | phone1 | phone2 | phone3 |
---|---|---|---|---|
1 | John | 3222430085 | 9983146924 | 1234567890 |
2 | Vanessa | 3222430085 | 9934254243 | |
3 | Maria | 7861628239 | 7861628239 | 7861628239 |
4 | Oscar | 7861157888 | 3222430085 | |
4 | Paul | 7151087912 | 5555555555 | |
5 | Paul | 7151087912 | 5555555555 |
Expected output
id | name | phone1 | phone2 | phone3 |
---|---|---|---|---|
1 | John | 3222430085 | 9983146924 | 1234567890 |
2 | Vanessa | 9934254243 | ||
3 | Maria | 7861628239 | ||
4 | Oscar | 7861157888 | ||
4 | Paul | 7151087912 | 5555555555 | |
5 | Paul |
I pose the problem basically based on a couple of selected columns to eliminate the duplicate records that are found within them.
In this case selected phone1
, phone2
and phone3
I have managed with pandas to remove entire rows that are exactly the same or by selecting columns but with this case I have not been able to solve it, if anyone has any idea if it is possible to do this I will be very grateful
CodePudding user response:
Filter
the phone
like columns then stack
to reshape into a multiindex series, then drop duplicates in the series then unstack
and assign
the result back to original dataframe
df.assign(**df.filter(like='phone').stack().drop_duplicates().unstack())
id name phone1 phone2 phone3
0 1 John 3222430085 9983146924 1234567890
1 2 Vanessa NaN 9934254243 NaN
2 3 Maria 7861628239 NaN NaN
3 4 Oscar 7861157888 NaN NaN
4 4 Paul 7151087912 5555555555 NaN
5 5 Paul NaN NaN NaN