Home > Blockchain >  Replace duplicate value in dataframe row with NaN
Replace duplicate value in dataframe row with NaN

Time:08-19

I have a dataframe which looks something like this:

index c1 c2 c3 m1 m2
1 a b c a c
2 d e f d a
3 a b c d e

For each row the values in c1, c2, c3 will never be duplicates of each other, same with m1 & m2 My aim is for each row to only show each value once, i.e if a values in m1 or m2 already exists in c1, c2 or c3 it should be replaced with NaN.

So for the example I gave, the result would be:

index c1 c2 c3 m1 m2
1 a b c NaN NaN
2 d e f NaN a
3 a b c d e

How would this best be solved? Thanks

CodePudding user response:

You can use numpy to perform broadcasting comparison:

# get "c" columns (you can use another method)
# and convert to numpy array
c = df.filter(regex='^c').to_numpy()
# get "m" columns (you can use another method)
m = df.filter(regex='^m')

# mask values in "m" that are also in "c" in any position
mask = (m.to_numpy()[...,None] == c[:,None,:]).any(-1)

# update dataframe
df[m.columns] = m.mask(mask)

output:

   index c1 c2 c3   m1   m2
0      1  a  b  c  NaN  NaN
1      2  d  e  f  NaN    a
2      3  a  b  c    d    e

intermediate mask:

array([[ True,  True],
       [ True, False],
       [False, False]])

CodePudding user response:

presuming you want a np.nan just stack to get a series and then then unstack and replace with the value you want.

df.mask(df.stack().duplicated().unstack(), np.nan)

This would be dependent on the order of your columns and with different sets you could just copy out the columns into another dataframe

  • Related