Assume, I have a data frame such as
import pandas as pd
df = pd.DataFrame({'visitor':['A','B','C','D','E'],
'col1':[1,2,3,4,5],
'col2':[1,2,4,7,8],
'col3':[4,2,3,6,1]})
visitor | col1 | col2 | col3 |
---|---|---|---|
A | 1 | 1 | 4 |
B | 2 | 2 | 2 |
C | 3 | 4 | 3 |
D | 4 | 7 | 6 |
E | 5 | 8 | 1 |
For each row/visitor, (1) First, if there are any identical values, I would like to keep the 1st value of each row then replace the rest of identical values in the same row with NULL such as
visitor | col1 | col2 | col3 |
---|---|---|---|
A | 1 | NULL | 4 |
B | 2 | NULL | NULL |
C | 3 | 4 | NULL |
D | 4 | 7 | 6 |
E | 5 | 8 | 1 |
Then (2) keep rows/visitors with more than 1 value such as
Final Data Frame
visitor | col1 | col2 | col3 |
---|---|---|---|
A | 1 | NULL | 4 |
C | 3 | 4 | NULL |
D | 4 | 7 | 6 |
E | 5 | 8 | 1 |
Any suggestions? many thanks
CodePudding user response:
Let us try mask
with pd.Series.duplicated
, then dropna
with thresh
out = df.mask(df.apply(pd.Series.duplicated,1)).dropna(thresh = df.shape[1]-1)
Out[321]:
visitor col1 col2 col3
0 A 1 NaN 4.0
2 C 3 4.0 NaN
3 D 4 7.0 6.0
4 E 5 8.0 1.0
CodePudding user response:
We can use series.duplicated
along the columns axis to identify the duplicates, then mask the duplicates using where
and filter the rows where the sum of non-duplicated values is greater than 1
s = df.set_index('visitor')
m = ~s.apply(pd.Series.duplicated, axis=1)
s.where(m)[m.sum(1).gt(1)]
col1 col2 col3
visitor
A 1 NaN 4.0
C 3 4.0 NaN
D 4 7.0 6.0
E 5 8.0 1.0