Update pandas column values if they are equal to another column's value-CodePudding

I have a pandas dataframe containing account IDs, home, work, and mobile phone numbers. All of these values are strings. My goal is to update the values of each row such that all duplicate numbers both within the same row and across different rows are set to NaN, leaving one 'original' number. How can I accomplish this in an efficient way?

When updating values in the same row, priority is given to home phone first and then work phone second. So if home == work == mobile, both work and mobile are updated to NaN. If home != work == mobile, then mobile is updated to NaN. When updating values in different rows, it does not matter which duplicate phone number is kept as the 'original' number. For example, if A['home'] == B['mobile'] == C['work'], two of those values should be set to NaN and the remaining one remain unchanged. I have chosen to keep the first number and set the other duplicate numbers to NaN when displaying the dataframe.

I've figured out how to update values within the same row using df.loc, but I've been unsuccessful in figuring out how I can also accomplish updating duplicate values to NaN across different rows and columns. How can I achieve this?

Below is further information on what I'm trying to do and where I'm getting stuck:

My initial dataframe looks something like this:

acct_id        home        work      mobile
      A  1111111111  1111111111  1111111111
      B  2222222222  2222222222  2222222222
      C  3333333333  3333333333  3333333333
      D  4444444444  5555555555  5555555555
      E  6666666666  7777777777  8888888888
      F  9999999999  9999999999  8888888888
      G  7777777777  6666666666  5555555555
      H  4444444444  3333333333  2222222222
      I         NaN         NaN         NaN

and my goal is to update the dataframe so that it looks like this:

acct_id        home        work      mobile
      A  1111111111         NaN         NaN
      B  2222222222         NaN         NaN
      C  3333333333         NaN         NaN
      D  4444444444  5555555555         NaN
      E  6666666666  7777777777  8888888888
      F  9999999999         NaN         NaN
      G         NaN         NaN         NaN
      H         NaN         NaN         NaN
      I         NaN         NaN         NaN

I'm currently approaching this as a 2 step problem. Step 1 is removing duplicate numbers in the same row. Step 2 is removing duplicate numbers across different rows and different columns. I have figured out step 1, using the df.loc command:

df.loc[df['home'] == df['work'], ['work']] = np.nan
df.loc[df['home'] == df['mobile'], ['mobile']] = np.nan
df.loc[df['work'] == df['mobile'], ['mobile']] = np.nan

This is what my dataframe looks like after running the above commands:

acct_no        home        work      mobile
      A  1111111111         NaN         NaN
      B  2222222222         NaN         NaN
      C  3333333333         NaN         NaN
      D  4444444444  5555555555         NaN
      E  6666666666  7777777777  8888888888
      F  9999999999         NaN  8888888888
      G  7777777777  6666666666  5555555555
      H  4444444444  3333333333  2222222222
      I         NaN         NaN         NaN

However, I can't wrap my head around step 2. As a brute force method, I have found that I can sort the dataframe on home and then loop through each row, checking if the previous row's home value is the same as the current row's, setting the current row's value to nan if it is the same. Lastly, I would have to repeat that process for the work and mobile keys. This is what the code for checking the home field would look like:

df.sort_values(by='home', inplace=True)
prev_row = {'home':None,'work':None,'mobile':None}
    for cur_idx,cur_row in df.iterrows():
        if prev_row['home'] == cur_row['home']:
            cur_row['home'] = np.nan
        prev_row = cur_row

After running the above code just for updating and checking on the home column, my dataframe will look like this:

acct_no        home        work      mobile
      A  1111111111         NaN         NaN
      B  2222222222         NaN         NaN
      C  3333333333         NaN         NaN
      D  4444444444  5555555555         NaN
      E         NaN  3333333333  2222222222
      F  6666666666  7777777777  8888888888
      G  7777777777  6666666666  5555555555
      H  9999999999         NaN  8888888888
      I         NaN         NaN         NaN

This solution is pretty hacky and not efficient for larger datasets, so how can I achieve this in a more efficient manner?

Any help is greatly appreciated -- thank you in advance!

CodePudding user response：

This might address your step 2 needs. If not, feel free to go with another approach.

df = pd.DataFrame(
    [
        dict(acct_no="D", home="4444444444", work="5555555555"),
        dict(acct_no="E", home=np.NaN, work="3333333333", mobile="2222222222"),
        dict(acct_no="J", home=np.NaN, work=np.NaN, mobile="8888888888"),
        dict(acct_no="K", home=np.NaN, work="8888888888"),
        dict(acct_no="L", home=np.NaN, work=np.NaN, mobile="8888888888"),
    ]
)
df["phone"] = (df.home
               .combine_first(df.work)
               .combine_first(df.mobile))
df = (df.sort_values(by="phone")
      .drop_duplicates(subset="phone")
      .set_index("acct_no"))
print(df)

output

               home        work      mobile       phone
acct_no                                                
E               NaN  3333333333  2222222222  3333333333
D        4444444444  5555555555         NaN  4444444444
J               NaN         NaN  8888888888  8888888888

In this implementation we are only looking at the phone column, which is the preferred number for an account. That might be a bit more draconian that desired. Notice for example that accounts "K" and "L" were nuked entirely, on the basis of sharing a phone number with "J". If multiple customers share a Home land line, that might not be the desired business logic. Notice also that if "K" were to add a Home number of 7878787878 he would survive, despite the 8888888888 dup. If Mobile is "more unique" than Home, perhaps we should prefer that number.

Now that we have used the phone column to good advantage, feel free to .drop() it.

The sort costs O(N log N), and everything else is linear, so this should be a performant solution, even for large datasets.

CodePudding user response：

In my opinion, the simplest would be to stack the numbers as Series, mask or drop_duplicates and then restore the original shape:

out = (df.set_index('acct_id').stack()
         # the magic happens here
         .mask(lambda d: d.duplicated())
         # restore original format
         .unstack().reset_index()
       )

Alternative:

out = (df.set_index('acct_id')
         .stack().drop_duplicates().unstack()
         .reindex(df['acct_id']).reset_index().reindex(columns=df.columns)
      )

Output:

  acct_id        home        work      mobile
0       A  1111111111         NaN         NaN
1       B  2222222222         NaN         NaN
2       C  3333333333         NaN         NaN
3       D  4444444444  5555555555         NaN
4       E  6666666666  7777777777  8888888888
5       F  9999999999         NaN         NaN
6       G         NaN         NaN         NaN
7       H         NaN         NaN         NaN
8       I         NaN         NaN         NaN

order: columns first

If you want, you can easily tweak the above to give a column-first preference in the way to chose the duplicates to keep:

out = (df.set_index('acct_id').unstack()
         .mask(lambda d: d.duplicated())
         .swaplevel().unstack().reset_index() # or: .unstack().T.reset_index()
      )

Output:

  acct_id        home        work      mobile
0       A  1111111111         NaN         NaN
1       B  2222222222         NaN         NaN
2       C  3333333333         NaN         NaN
3       D  4444444444  5555555555         NaN
4       E  6666666666         NaN  8888888888
5       F  9999999999         NaN         NaN
6       G  7777777777         NaN         NaN
7       H         NaN         NaN         NaN
8       I         NaN         NaN         NaN