I have the table below and would like to apply onde command to compare and eliminate duplicate values in row n
and n 1
in multiple dataframes (df1, df2)
.
Comand sugestion: .diff().ne(0)
How to apply this command only to the elements of columns A ,C and D, using the commands def
,lambda
or apply
?
df1:
A | B |
---|---|
22 | 33 |
22 | 4 |
3 | 55 |
1 | 55 |
df2:
C | D |
---|---|
5 | 2.3 |
45 | 33 |
7 | 33 |
7 | 11 |
The expected output is:
df1:
A | B |
---|---|
22 | 33 |
NaN | 4 |
3 | 55 |
1 | 55 |
df2:
C | D |
---|---|
5 | 2.3 |
45 | 33 |
7 | NaN |
NaN | 11 |
The other desired option would be to delete the duplicated lines, keeping the first number.
df1:
A | B |
---|---|
22 | 33 |
row deleted | row deleted |
3 | 55 |
row deleted | row deleted |
df2:
C | D |
---|---|
5 | 2.3 |
45 | 33 |
row deleted | row deleted |
row deleted | row deleted |
CodePudding user response:
Based on this answer, you can create a mask for a single column in your dataframe (here for example for column A
) with
mask1 = df['A'].shift() == df['A']
Since this shows True
if there was a duplicate, you need to slice the DataFrame with the negation of the mask
df = df[~mask1]
To do this for multiple columns, make a mask for each column and use NumPy's logical_or
to combine the masks. Then slice df
with the final mask.
CodePudding user response:
With your suggested command: .diff().ne(0)
(or .diff.eq(0)
)
Option 1: set NaN
to duplicate values
# For 1 column
df1.loc[df1['A'].diff().eq(0), 'A'] = np.NaN
print(df1)
A B
0 22.0 33
1 NaN 4
2 3.0 55
3 1.0 55
# For multiple columns
df2 = df2.apply(lambda x: x[x.diff().ne(0)])
print(df2)
C D
0 5.0 2.3
1 45.0 33.0
2 7.0 NaN
3 NaN 11.0
Option 2: delete rows
>>> df1[df1.diff().ne(0).all(axis=1)]
A B
0 22 33
2 3 55
>>> df2[df2.diff().ne(0).all(axis=1)]
C D
0 5 2.3
1 45 33.0