Home > Software design >  How to replace column values based on other columns in pandas?
How to replace column values based on other columns in pandas?

Time:02-24

Assume, I have a data frame such as

import pandas as pd 
df = pd.DataFrame({'visitor':['A','B','C','D','E'],
                   'col1':[1,2,3,4,5],
                   'col2':[1,2,4,7,8],
                   'col3':[4,2,3,6,1]})
visitor col1 col2 col3
A 1 1 4
B 2 2 2
C 3 4 3
D 4 7 6
E 5 8 1

For each row/visitor, (1) First, if there are any identical values, I would like to keep the 1st value of each row then replace the rest of identical values in the same row with NULL such as

visitor col1 col2 col3
A 1 NULL 4
B 2 NULL NULL
C 3 4 NULL
D 4 7 6
E 5 8 1

Then (2) keep rows/visitors with more than 1 value such as
Final Data Frame

visitor col1 col2 col3
A 1 NULL 4
C 3 4 NULL
D 4 7 6
E 5 8 1

Any suggestions? many thanks

CodePudding user response:

Let us try mask with pd.Series.duplicated, then dropna with thresh

out = df.mask(df.apply(pd.Series.duplicated,1)).dropna(thresh = df.shape[1]-1)
Out[321]: 
  visitor  col1  col2  col3
0       A     1   NaN   4.0
2       C     3   4.0   NaN
3       D     4   7.0   6.0
4       E     5   8.0   1.0

CodePudding user response:

We can use series.duplicated along the columns axis to identify the duplicates, then mask the duplicates using where and filter the rows where the sum of non-duplicated values is greater than 1

s = df.set_index('visitor')
m = ~s.apply(pd.Series.duplicated, axis=1)

s.where(m)[m.sum(1).gt(1)]

         col1  col2  col3
visitor                  
A           1   NaN   4.0
C           3   4.0   NaN
D           4   7.0   6.0
E           5   8.0   1.0
  • Related