How to replace column values based on other columns in pandas?-CodePudding

Assume, I have a data frame such as

import pandas as pd 
df = pd.DataFrame({'visitor':['A','B','C','D','E'],
                   'col1':[1,2,3,4,5],
                   'col2':[1,2,4,7,8],
                   'col3':[4,2,3,6,1]})

visitor	col1	col2	col3
A	1	1	4
B	2	2	2
C	3	4	3
D	4	7	6
E	5	8	1

For each row/visitor, (1) First, if there are any identical values, I would like to keep the 1st value of each row then replace the rest of identical values in the same row with NULL such as

visitor	col1	col2	col3
A	1	NULL	4
B	2	NULL	NULL
C	3	4	NULL
D	4	7	6
E	5	8	1

Then (2) keep rows/visitors with more than 1 value such as
Final Data Frame

visitor	col1	col2	col3
A	1	NULL	4
C	3	4	NULL
D	4	7	6
E	5	8	1

Any suggestions? many thanks

CodePudding user response：

Let us try mask with pd.Series.duplicated, then dropna with thresh

out = df.mask(df.apply(pd.Series.duplicated,1)).dropna(thresh = df.shape[1]-1)
Out[321]: 
  visitor  col1  col2  col3
0       A     1   NaN   4.0
2       C     3   4.0   NaN
3       D     4   7.0   6.0
4       E     5   8.0   1.0

CodePudding user response：

We can use series.duplicated along the columns axis to identify the duplicates, then mask the duplicates using where and filter the rows where the sum of non-duplicated values is greater than 1

s = df.set_index('visitor')
m = ~s.apply(pd.Series.duplicated, axis=1)

s.where(m)[m.sum(1).gt(1)]

         col1  col2  col3
visitor                  
A           1   NaN   4.0
C           3   4.0   NaN
D           4   7.0   6.0
E           5   8.0   1.0