I have a dataframe which looks something like this:
index | c1 | c2 | c3 | m1 | m2 |
---|---|---|---|---|---|
1 | a | b | c | a | c |
2 | d | e | f | d | a |
3 | a | b | c | d | e |
For each row the values in c1, c2, c3 will never be duplicates of each other, same with m1 & m2 My aim is for each row to only show each value once, i.e if a values in m1 or m2 already exists in c1, c2 or c3 it should be replaced with NaN.
So for the example I gave, the result would be:
index | c1 | c2 | c3 | m1 | m2 |
---|---|---|---|---|---|
1 | a | b | c | NaN | NaN |
2 | d | e | f | NaN | a |
3 | a | b | c | d | e |
How would this best be solved? Thanks
CodePudding user response:
You can use numpy to perform broadcasting comparison:
# get "c" columns (you can use another method)
# and convert to numpy array
c = df.filter(regex='^c').to_numpy()
# get "m" columns (you can use another method)
m = df.filter(regex='^m')
# mask values in "m" that are also in "c" in any position
mask = (m.to_numpy()[...,None] == c[:,None,:]).any(-1)
# update dataframe
df[m.columns] = m.mask(mask)
output:
index c1 c2 c3 m1 m2
0 1 a b c NaN NaN
1 2 d e f NaN a
2 3 a b c d e
intermediate mask
:
array([[ True, True],
[ True, False],
[False, False]])
CodePudding user response:
presuming you want a np.nan
just stack to get a series and then then unstack and replace with the value you want.
df.mask(df.stack().duplicated().unstack(), np.nan)
This would be dependent on the order of your columns and with different sets you could just copy out the columns into another dataframe