I have a pandas dataframe containing account IDs, home, work, and mobile phone numbers. All of these values are strings. My goal is to update the values of each row such that all duplicate numbers both within the same row and across different rows are set to NaN, leaving one 'original' number. How can I accomplish this in an efficient way?
When updating values in the same row, priority is given to home phone first and then work phone second. So if home == work == mobile
, both work and mobile are updated to NaN. If home != work == mobile
, then mobile is updated to NaN. When updating values in different rows, it does not matter which duplicate phone number is kept as the 'original' number. For example, if A['home'] == B['mobile'] == C['work']
, two of those values should be set to NaN and the remaining one remain unchanged. I have chosen to keep the first number and set the other duplicate numbers to NaN when displaying the dataframe.
I've figured out how to update values within the same row using df.loc
, but I've been unsuccessful in figuring out how I can also accomplish updating duplicate values to NaN across different rows and columns. How can I achieve this?
Below is further information on what I'm trying to do and where I'm getting stuck:
My initial dataframe looks something like this:
acct_id home work mobile
A 1111111111 1111111111 1111111111
B 2222222222 2222222222 2222222222
C 3333333333 3333333333 3333333333
D 4444444444 5555555555 5555555555
E 6666666666 7777777777 8888888888
F 9999999999 9999999999 8888888888
G 7777777777 6666666666 5555555555
H 4444444444 3333333333 2222222222
I NaN NaN NaN
and my goal is to update the dataframe so that it looks like this:
acct_id home work mobile
A 1111111111 NaN NaN
B 2222222222 NaN NaN
C 3333333333 NaN NaN
D 4444444444 5555555555 NaN
E 6666666666 7777777777 8888888888
F 9999999999 NaN NaN
G NaN NaN NaN
H NaN NaN NaN
I NaN NaN NaN
I'm currently approaching this as a 2 step problem. Step 1 is removing duplicate numbers in the same row. Step 2 is removing duplicate numbers across different rows and different columns.
I have figured out step 1, using the df.loc
command:
df.loc[df['home'] == df['work'], ['work']] = np.nan
df.loc[df['home'] == df['mobile'], ['mobile']] = np.nan
df.loc[df['work'] == df['mobile'], ['mobile']] = np.nan
This is what my dataframe looks like after running the above commands:
acct_no home work mobile
A 1111111111 NaN NaN
B 2222222222 NaN NaN
C 3333333333 NaN NaN
D 4444444444 5555555555 NaN
E 6666666666 7777777777 8888888888
F 9999999999 NaN 8888888888
G 7777777777 6666666666 5555555555
H 4444444444 3333333333 2222222222
I NaN NaN NaN
However, I can't wrap my head around step 2. As a brute force method, I have found that I can sort the dataframe on home and then loop through each row, checking if the previous row's home value is the same as the current row's, setting the current row's value to nan if it is the same. Lastly, I would have to repeat that process for the work and mobile keys. This is what the code for checking the home field would look like:
df.sort_values(by='home', inplace=True)
prev_row = {'home':None,'work':None,'mobile':None}
for cur_idx,cur_row in df.iterrows():
if prev_row['home'] == cur_row['home']:
cur_row['home'] = np.nan
prev_row = cur_row
After running the above code just for updating and checking on the home column, my dataframe will look like this:
acct_no home work mobile
A 1111111111 NaN NaN
B 2222222222 NaN NaN
C 3333333333 NaN NaN
D 4444444444 5555555555 NaN
E NaN 3333333333 2222222222
F 6666666666 7777777777 8888888888
G 7777777777 6666666666 5555555555
H 9999999999 NaN 8888888888
I NaN NaN NaN
This solution is pretty hacky and not efficient for larger datasets, so how can I achieve this in a more efficient manner?
Any help is greatly appreciated -- thank you in advance!
CodePudding user response:
This might address your step 2 needs. If not, feel free to go with another approach.
df = pd.DataFrame(
[
dict(acct_no="D", home="4444444444", work="5555555555"),
dict(acct_no="E", home=np.NaN, work="3333333333", mobile="2222222222"),
dict(acct_no="J", home=np.NaN, work=np.NaN, mobile="8888888888"),
dict(acct_no="K", home=np.NaN, work="8888888888"),
dict(acct_no="L", home=np.NaN, work=np.NaN, mobile="8888888888"),
]
)
df["phone"] = (df.home
.combine_first(df.work)
.combine_first(df.mobile))
df = (df.sort_values(by="phone")
.drop_duplicates(subset="phone")
.set_index("acct_no"))
print(df)
output
home work mobile phone
acct_no
E NaN 3333333333 2222222222 3333333333
D 4444444444 5555555555 NaN 4444444444
J NaN NaN 8888888888 8888888888
In this implementation we are only looking at the phone
column, which is the preferred number for an account.
That might be a bit more draconian that desired.
Notice for example that accounts "K" and "L" were nuked entirely,
on the basis of sharing a phone number with "J".
If multiple customers share a Home land line,
that might not be the desired business logic.
Notice also that if "K" were to add a Home
number of 7878787878 he would survive,
despite the 8888888888 dup.
If Mobile is "more unique" than Home,
perhaps we should prefer that number.
Now that we have used the phone
column
to good advantage, feel free to .drop() it.
The sort costs O(N log N), and everything else is linear, so this should be a performant solution, even for large datasets.
CodePudding user response:
In my opinion, the simplest would be to stack
the numbers as Series, mask
or drop_duplicates
and then restore the original shape:
out = (df.set_index('acct_id').stack()
# the magic happens here
.mask(lambda d: d.duplicated())
# restore original format
.unstack().reset_index()
)
Alternative:
out = (df.set_index('acct_id')
.stack().drop_duplicates().unstack()
.reindex(df['acct_id']).reset_index().reindex(columns=df.columns)
)
Output:
acct_id home work mobile
0 A 1111111111 NaN NaN
1 B 2222222222 NaN NaN
2 C 3333333333 NaN NaN
3 D 4444444444 5555555555 NaN
4 E 6666666666 7777777777 8888888888
5 F 9999999999 NaN NaN
6 G NaN NaN NaN
7 H NaN NaN NaN
8 I NaN NaN NaN
order: columns first
If you want, you can easily tweak the above to give a column-first preference in the way to chose the duplicates to keep:
out = (df.set_index('acct_id').unstack()
.mask(lambda d: d.duplicated())
.swaplevel().unstack().reset_index() # or: .unstack().T.reset_index()
)
Output:
acct_id home work mobile
0 A 1111111111 NaN NaN
1 B 2222222222 NaN NaN
2 C 3333333333 NaN NaN
3 D 4444444444 5555555555 NaN
4 E 6666666666 NaN 8888888888
5 F 9999999999 NaN NaN
6 G 7777777777 NaN NaN
7 H NaN NaN NaN
8 I NaN NaN NaN